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ETAPS Foreword 


Welcome to the 24th ETAPS! ETAPS 2021 was originally planned to take place in 
Luxembourg in its beautiful capital Luxembourg City. Because of the Covid-19 pan- 
demic, this was changed to an online event. 

ETAPS 2021 was the 24th instance of the European Joint Conferences on Theory 
and Practice of Software. ETAPS is an annual federated conference established in 
1998, and consists of four conferences: ESOP, FASE, FoSSaCS, and TACAS. Each 
conference has its own Program Committee (PC) and its own Steering Committee 
(SC). The conferences cover various aspects of software systems, ranging from theo- 
retical computer science to foundations of programming languages, analysis tools, and 
formal approaches to software engineering. Organising these conferences in a coherent, 
highly synchronised conference programme enables researchers to participate in an 
exciting event, having the possibility to meet many colleagues working in different 
directions in the field, and to easily attend talks of different conferences. On the 
weekend before the main conference, numerous satellite workshops take place that 
attract many researchers from all over the globe. 

ETAPS 2021 received 260 submissions in total, 115 of which were accepted, 
yielding an overall acceptance rate of 44.2%. I thank all the authors for their interest in 
ETAPS, all the reviewers for their reviewing efforts, the PC members for their con- 
tributions, and in particular the PC (co-)chairs for their hard work in running this entire 
intensive process. Last but not least, my congratulations to all authors of the accepted 
papers! 

ETAPS 2021 featured the unifying invited speakers Scott Smolka (Stony Brook 
University) and Jane Hillston (University of Edinburgh) and the conference-specific 
invited speakers Isil Dillig (University of Texas at Austin) for ESOP and Willem Visser 
(Stellenbosch University) for FASE. Inivited tutorials were provided by Erika Abraham 
(RWTH Aachen University) on analysis of hybrid systems and Madhusudan 
Parthasararathy (University of Illinois at Urbana-Champaign) on combining machine 
learning and formal methods. 

ETAPS 2021 was originally supposed to take place in Luxembourg City, Luxem- 
bourg organized by the SnT - Interdisciplinary Centre for Security, Reliability and 
Trust, University of Luxembourg. University of Luxembourg was founded in 2003. 
The university is one of the best and most international young universities with 6,700 
students from 129 countries and 1,331 academics from all over the globe. The local 
organisation team consisted of Peter Y.A. Ryan (general chair), Peter B. Roenne (or- 
ganisation chair), Joaquin Garcia-Alfaro (workshop chair), Magali Martin (event 
manager), David Mestel (publicity chair), and Alfredo Rial (local proceedings chair). 

ETAPS 2021 was further supported by the following associations and societies: 
ETAPS e.V., EATCS (European Association for Theoretical Computer Science), 
EAPLS (European Association for Programming Languages and Systems), and EASST 
(European Association of Software Science and Technology). 
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The ETAPS Steering Committee consists of an Executive Board, and representa- 
tives of the individual ETAPS conferences, as well as representatives of EATCS, 
EAPLS, and EASST. The Executive Board consists of Holger Hermanns 
(Saarbrücken), Marieke Huisman (Twente, chair), Jan Kofron (Prague), Barbara König 
(Duisburg), Gerald Liittgen (Bamberg), Caterina Urban (INRIA), Tarmo Uustalu 
(Reykjavik and Tallinn), and Lenore Zuck (Chicago). 

Other members of the steering committee are: Patricia Bouyer (Paris), Einar Broch 
Johnsen (Oslo), Dana Fisman (Be’er Sheva), Jan Friso Groote (Eindhoven), Esther 
Guerra (Madrid), Reiko Heckel (Leicester), Joost-Pieter Katoen (Aachen and Twente), 
Stefan Kiefer (Oxford), Fabrice Kordon (Paris), Jan Křetínský (Munich), Kim G. 
Larsen (Aalborg), Tiziana Margaria (Limerick), Andrew M. Pitts (Cambridge), Grigore 
Roşu (illinois), Peter Ryan (Luxembourg), Don Sannella (Edinburgh), Lutz Schröder 
(Erlangen), Ilya Sergey (Singapore), Mariélle Stoelinga (Twente), Gabriele Taentzer 
(Marburg), Christine Tasson (Paris), Peter Thiemann (Freiburg), Jan Vitek (Prague), 
Anton Wijs (Eindhoven), Manuel Wimmer (Linz), and Nobuko Yoshida (London). 

Id like to take this opportunity to thank all the authors, attendees, organizers of the 
satellite workshops, and Springer-Verlag GmbH for their support. I hope you all 
enjoyed ETAPS 2021. 

Finally, a big thanks to Peter, Peter, Magali and their local organisation team for all 
their enormous efforts to make ETAPS a fantastic online event. I hope there will be a 
next opportunity to host ETAPS in Luxembourg. 


February 2021 Marieke Huisman 
ETAPS SC Chair 
ETAPS e.V. President 


Preface 


TACAS 2021 was the 27th edition of the International Conference on Tools and 
Algorithms for the Construction and Analysis of Systems conference series. TACAS 
2021 was part of the 24th European Joint Conferences on Theory and Practice of 
Software (ETAPS 2021), which although originally planned to take place in 
Luxembourg City, was held as an online event on March 27 to April 1 due the the 
COVID-19 pandemic. 

TACAS is a forum for researchers, developers, and users interested in rigorously 
based tools and algorithms for the construction and analysis of systems. The conference 
aims to bridge the gaps between different communities with this common interest and 
to support them in their quest to improve the utility, reliability, flexibility, and effi- 
ciency of tools and algorithms for building computer-controlled systems. There were 
four types of submissions for TACAS: 


— Research papers advancing the theoretical foundations for the construction and 
analysis of systems. 

— Case study papers with an emphasis on a real-world setting. 

— Regular tool papers presenting a new tool, a new tool component, or novel 
extensions to an existing tool and requiring an artifact submission. 

— Tool demonstration papers focusing on the usage aspects of tools, also subject to the 
artifact submission requirement. 


This year 141 papers were submitted to TACAS, consisting of 90 research papers, 
29 regular tool papers, 16 tool demo papers, and 6 case study papers. Authors were 
allowed to submit up to four papers. Each paper was reviewed by three Program 
Committee (PC) members, who made extensive use of subreviewers. 

Similarly to previous years, it was possible to submit an artifact alongside a paper, 
which was mandatory for regular tool and tool demo papers. An artifact might consist 
of a tool, models, proofs, or other data required for validation of the results of the 
paper. The Artifact Evaluation Committee (AEC) was tasked with reviewing the 
artifacts, based on their documentation, ease of use, and, most importantly, whether the 
results presented in the corresponding paper could be accurately reproduced. Most 
of the evaluation was carried out using a standardised virtual machine to ensure con- 
sistency of the results, except for those artifacts that had special hardware requirements. 

The evaluation consisted of two rounds. The first round was carried out in parallel 
with the work of the PC. The judgment of the AEC was communicated to the PC and 
weighed in their discussion. The second round took place after paper acceptance 
notifications were sent out; authors of accepted research papers who did not submit an 
artifact in the first round could submit their artifact here. In total, 72 artifacts were 
submitted (63 in the first round and 9 in the second), of which 57 were accepted and 15 
rejected. This corresponds to an acceptance rate of 79 percent. Papers with an accepted 
artifact include a badge on the first page. 
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Selected authors were requested to provide a rebuttal for both papers and artifacts in 
case a review gave rise to questions. In total 166 rebuttals were provided. Using the 
review reports and rebuttals the Programme and the Artifact Evaluation Committees 
extensively discussed the papers and artifacts and ultimately decided to accept 32 
research papers, 7 tool papers, 6 tool demos, and 2 case studies. 

Besides the regular conference papers, this two-volume proceedings also contains 8 
short papers that describe the participating verification systems and a competition 
report presenting the results of the 10th SV-COMP, the competition on automatic 
software verifiers for C and Java programs. These papers were reviewed by a separate 
program committee (PC); each of the papers was assessed by at least three reviewers. 
A total of 30 verification systems with developers from 11 countries entered the sys- 
tematic comparative evaluation, including four submissions from industry. Two ses- 
sions in the TACAS program were reserved for the presentation of the results: (1) a 
summary by the competition chair and of the participating tools by the developer teams 
in the first session, and (2) an open community meeting in the second session. 
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Abstract. Numerous tasks in program analysis and synthesis reduce to 
deciding reachability in possibly infinite graphs such as those induced by 
Petri nets. However, the Petri net reachability problem has recently been 
shown to require non-elementary time, which raises questions about the 
practical applicability of Petri nets as target models. In this paper, we 
introduce a novel approach for efficiently semi-deciding the reachability 
problem for Petri nets in practice. Our key insight is that computa- 
tionally lightweight over-approximations of Petri nets can be used as 
distance oracles in classical graph exploration algorithms such as A* and 
greedy best-first search. We provide and evaluate a prototype implemen- 
tation of our approach that outperforms existing state-of-the-art tools, 
sometimes by orders of magnitude, and which is also competitive with 
domain-specific tools on benchmarks coming from program synthesis and 
concurrent program analysis. 


Keywords: Petri nets - reachability - shortest paths - model checking 


1 Introduction 


Many problems in program analysis, synthesis and verification reduce to decid- 
ing reachability of a vertex or a set of vertices in infinite graphs, e.g., when 
reasoning about concurrent programs with an unbounded number of threads, 
or when arbitrarily many components can be used in a synthesis task. For au- 
tomated reasoning tasks, those infinite graphs are finitely represented by some 
mathematical model. Finding the right such model requires a trade-off between 
the two conflicting goals of maximal expressive power and computational feasi- 
bility of the relevant decision problems. Petri nets are a ubiquitous mathemati- 
cal model that provides a good compromise between those two goals. They are 
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the Petri net reachability problem can be obtained from: arxiv.org/abs/2010.07912. 
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expressive enough to find a plethora of applications in computer science, in par- 
ticular in the analysis of concurrent processes, yet the reachability problem for 
Petri nets is decidable [47,40,41,43]. Counter abstraction has evolved as a generic 
abstraction paradigm that reduces a variety of program analysis tasks to prob- 
lems in Petri nets or variants thereof such as well-structured transition systems, 
see e.g. [30,39,61,5]. Due to their generality and versatility, Petri nets and their 
extensions find numerous applications also in other areas, including the design 
and analysis of protocols [22], business processes [57], biological systems [33,11] 
and chemical systems [2]. The goal of this paper is to introduce and evaluate 
an efficient generic approach to deciding the Petri net reachability problem on 
instances arising from applications in program verification and synthesis. 

A Petri net comprises a finite set of places with a finite number of transitions. 
Places carry a finite yet unbounded number of tokens and transitions can remove 
and add tokens to places. A marking specifies how many tokens each place 
carries. An example of a Petri net is given on the left-hand side of Figure 1, 
where the two places {p1, p2} are depicted as circles and transitions {t1, t2, t3} 
as squares. Places carry tokens depicted as filled circles; thus pı carries one token 
and pə carries none. We write this as [p;: 1,p2: 0], or (1,0) if there is a clear 
ordering on the places. Transition tı can add a single token to place pı at any 
moment. As soon as a token is present in pı, it can be consumed by transition 
t2, which then adds a token to place pọ and puts back one token to place pı. 
Finally, transition t3 consumes tokens from p; without adding any token at all. 


tı 


(1,0) <> (0,0) E 


(1,1) H 


a \ F 1,2) 
t3 a E 5] ZOL 


Fig. 1. Left: A Petri net M. Right: Search of the forthcoming Algorithm 1 over the 
graph Gn( M) from (0,0) to (0,1), where (x, y) denotes [p1 : £, p2: y] and each number 
in a box next to a marking is its heuristic value. Only the blue region is expanded. 


A Petri net induces a possibly infinite directed graph whose vertices are 
markings, and whose edges are determined by the transitions of the Petri net, 
cf. the right side of Figure 1. Given two markings, the reachability problem asks 
whether they are connected in this graph. In Figure 1, the marking (0,1) is 
tears from (0,0), e.g., via paths of lengths 3 and 5: (0,0) => ay (1,0) 2, 
(1,1) Ż (0,1) and (0,0) & (1,0) & (2,0) 4 (2,1) 4 (1,1) # (0,1). 

In practice, the Petri net reachability problem is a challenging decision prob- 
lem due to its horrendous worst-case complexity: an exponential-space lower 
bound was established in the 1970s [45], and a non-elementary time lower bound 
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has only recently been established [13]. One may thus question whether a prob- 
lem with such high worst-case complexity is of any practical relevance, and 
whether reducing program analysis tasks to Petri net reachability is anything 
else than merely an intellectual exercise. We debunk those concerns and present 
a technique which decides most reachability instances appearing in the wild. 
When evaluated on large-scale instances involving Petri nets with thousands of 
places and tens of thousands of transitions, our prototype implementation is 
most of the time faster, even up to several orders of magnitude on large-scale 
instances, and solves more instances than existing state-of-the-art tools. Our im- 
plementation is also competitive with specialized domain-specific tools. One of 
the biggest advantages of our approach is that it is extremely simple to describe 
and implement, and it readily generalizes to many extensions of Petri nets. In 
fact, it was surprising to us that our approach has not yet been discovered. We 
now describe the main observations and techniques underlying our approach. 
Ever since the early days of research in Petri nets, state-space over-approxi- 
mations have been studied to attenuate the high computational complexity of 
their decision problems. One such over-approximation is, informally speaking, 
to allow places to carry a negative number of tokens. Deciding reachability then 
reduces to solving the so-called state equation, a system of linear equations as- 
sociated to a Petri net. Another over-approximation are continuous Petri nets, 
a variant where places carry fractional tokens and “fractions of transitions” can 
be applied [14]. The benefit is that deciding reachability drops down to polyno- 
mial time [25]. While those approximations have been applied for pruning search 
spaces, see e.g. [23,4,8,29], we make the following simple key observation: 


If a marking m is reachable from an initial marking in an over- 
approximation, then the length of a shortest witnessing path in the over- 
approximation lower bounds the length of a shortest path reaching m. 


The availability of an oracle providing lower bounds on the length of shortest 
paths between markings enables us to appeal to classical graph traversal algo- 
rithms which have been highly successful in artificial intelligence and require such 
oracles, namely A“ and greedy best-first search, see e.g. [52]. In particular, deter- 
mining the length of shortest paths in the over-approximations described above 
can be phrased as optimization problems in (integer) linear programming and 
optimization modulo theories, for which efficient off-the-shelf solvers are avail- 
able [32,7]. Thus, oracle calls can be made at comparably modest computational 
cost, which is crucial for the applicability of those algorithms. As a result, a 
large class of existing state-space over-approximations can be applied to obtain 
a highly efficient forward-analysis semi-decision procedure for the reachability 
problem. For example, in Figure 1, using the state equation as distance oracle, 
A’ only explores the four vertices in the blue region and directly reaches the 
target vertex, whereas a breadth-first search may need to explore all vertices of 
the figure and a depth-first search may even not terminate. 

In theory, our approach could be turned into a decision procedure by ap- 
plying bounds on the length of shortest paths in Petri nets [44]. However, such 
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lengths can grow non-elementarily in the number of places [13], and just com- 
puting the cut-off length will already be infeasible for any Petri net of practical 
relevance. It is worth mentioning that, in practice, it has been observed that the 
over-approximations we employ also often witness non-reachability though, see 
e.g. [23]. Still, when dealing with finite state spaces, our procedure is complete. 

A noteworthy benefit of our approach is that it enables finding shortest paths 
when A* is used as the underlying algorithm. In program analysis, paths usually 
correspond to traces reaching an erroneous configuration. In this setting, shorter 
error traces are preferred as they help understanding why a certain error occurs. 
Furthermore, in program synthesis, paths correspond to synthesis plans. Again, 
shorter paths are preferred as they yield shorter synthesized programs. In fact, 
we develop our algorithmic framework for weighted Petri nets in which transi- 
tions are weighted with positive integers. Classical Petri nets correspond to the 
special instance where all weights are equal to one. Weighted Petri nets are useful 
to reflect cost or preferences in synthesis tasks. For example, there are program 
synthesis approaches where software projects are mined to determine how often 
API methods are called to guide a procedure by preferring more frequent meth- 
ods [27,26,46]. Similarity metrics can also be used to obtain costs estimating the 
relevance of invoking methods [24]. It has further been argued that weighted 
Petri nets are a good model for synthesis tasks of chemical reactions as they can 
reflect costs of various chemical compounds [58]. Finally, weights can be viewed 
as representing an amount of time it takes to fire a transition, see e.g. [50]. 


Related work. Our approach falls under the umbrella term directed model check- 
ing coined in the early 2000s, which refers to a set of techniques to tackle the 
state-explosion problem via guided state-space exploration. It primarily targets 
disproving safety properties by quickly finding a path to an error state without 
the need to explicitly construct the whole state space. As such, directed model 
checking is useful for bug-finding since, in the words of Yang and Dill [60], in 
practice, model checkers are most useful when they find bugs, not when they prove 
a property. The survey paper [20] gives an overview over various directed model 
checking techniques for finite-state systems. 

For Petri nets, directed reachability algorithms based on over-approximations 
as developed in this work have not been described. In [56], it is argued that ex- 
ploration heuristics, like A*, can be useful for Petri nets, but they do not consider 
over-approximations for the underlying heuristic functions. The authors of [36] 
use Petri nets for scheduling problems and employ the state equation, viewed as 
a system of linear equations over Q, in order to explore and prune reachability 
graphs. This approach is, however, not guaranteed to discover shortest paths. 
There has been further work on using A” for exploring the reachability graph of 
Petri nets for scheduling problems, see, e.g., [42,48] and the references therein. 


2 Preliminaries 


Let N := {0,1,...}. For all D C Q and > € {>, >}, let Do = {a € D : a > 0}, 
and for every set X, let D* denote the set of vectors D* := {v | v: X > D}. 
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We naturally extend operations componentwise. In particular, (u + v)(x) = 
u(x) + v(x) for every x € X, and u > v iff u(x) > v(x) for every x € X. 


Graphs. A (labeled directed) graph is a triple G = (V, E, A), where V is a set of 
nodes, A is a finite set of elements called actions, and E C V x A x V is the 
set of edges labeled by actions. We say that G has finite out-degree if the set of 
outgoing edges {(w,a,w’) € E : w = v} is finite for every v € V. Similarly, it has 
finite in-degree if the set of ingoing edges is finite for every v € V. If G has both 
finite out- and in-degree, then we say that G is locally finite. A path r is a finite 
sequence of nodes (v;),<;<, and actions (a;),<;-,, such that (vj,a;,vi41) E€ E 
for all 1 <i < n. We say that m is a path from v to w (or a v-w path) if v = vı 
and w = vn, and its label is ajaq---Gdn— 1, where £ denotes the empty sequence. 

A weighted graph is a tuple G = (V,E,A,) where (V, E, A) is a graph 
with a weight function u: E —> Qso. The weight of path m is the weight of its 
edges, i.e. u(t) = Yop cicn M(Vi, Qi, Viti). A shortest path from v to w is a v-w 
path m minimizing u(r). We define distg: V x V + Qso U {00} as the distance 
function where distg(v,w) is the weight of a shortest path from v to w, with 
dista (v, w) := co if there is none. We assume throughout the paper that weighted 
graphs have a minimal weight, i.e. that min{u(e) : e € E} exists. For graphs 
with finite out-degree, this ensures that if a path exists between two nodes, then 
a shortest one exists.4 This mild assumption always holds in our setting. 


Petri nets. A weighted Petri net is a tuple N = (P,T, f, A) where 


— P is a finite set whose elements are called places, 

— T is a finite set, disjoint from P, whose elements are called transitions, 

— f: (Px T)U(T x P) > N is the flow function assigning multiplicities to 
arcs connecting places and transitions, and 

— A: T + Qso is the weight function assigning weights to transitions. 


A marking is a vector m € NP which indicates that place p holds m(p) tokens. A 
weighted Petri net with A(t) = 1 for each t € T is called a Petri net. For example, 
Figure 1 depicts a Petri net M with P = {p1, p2}, T = {t1,te,ts}, f(pi,ts) = 
f(pi,t2) = f(ti,pi) = f(te,p1) = f(t2,p2) = 1 (multiplicity omitted on arcs) 
and f(—,—) = 0 elsewhere (no arc). Moreover, M is marked with [p;: 1, pa: 0]. 

The guard and effect of a transition t € T are vectors g, € N? and A, € ZP 
where g;(p) := f(p,t) and A;(p) = f(t,p) — f(p,t). We say that t is firable 
from marking m if m > g. If t is firable from m, then it may be fired, which 
leads to marking m’ := m + A;. We write this as m Seng m’. These notions 
naturally extend to sequences of transitions, i.e. —y denotes the identity relation 
over NP, A: := 0, A(eé) := 0, and for every ti, to,...,th E€ T: Agey-t, = 
An + Ap, HeH Ab, A(tite +++ te) = Alti) + A(ta) +--+ A(ty), and 


tytg--:th tk ta ti 
—— Tn = Wy 0°10 NO HN. 


* Otherwise, there could be increasingly better paths, e.g. of weights 1,1/2,1/4,.... 


8 M. Blondin et al. 


We say that >y:= Uter Sen and Sy:= User. Zy are the step and reachability 
relations. Note that the latter is the reflexive transitive closure of >y. 

For example, m tts y m and m 222%,. m in Figure 1, where m := 
[pı: 1, p2: 0] and m’ = [p1 : 0, p2: 1]. Moreover, tz is not firable in m’. 

Given a sequence o € T*, denote by |o|, € N the number of times transition 
t occurs in ø. The Parikh image of o is the vector ø € NT that captures the 
number of occurrences of transitions appearing in g, i.e. ø (t) := |o|; for all t € T. 

Each weighted Petri net M = (P,T, f, A) induces a locally finite weighted 
graph GN(N) := (V, E,T, p), called its reachability graph, where V := N?, E := 
{(m,t, m’) : my m’} and u(m,t, m’) := A(t) for each (m, t, m’) € E. An 
example of a reachability graph is given on the right of Figure 1. We write dist y 
to denote distaw): We have disty (m, m’) # œ iff m Zy m for some o € T*, 
and if the latter holds, then distw(m, m’) is the minimal weight among such 
firing sequences ø. Moreover, for (unweighted) Petri nets, distw (m, m’) is the 
minimal number of transitions to fire to reach m’ from m. 


3 Directed Search Algorithms 


Our approach relies on classical pathfinding procedures guided by node selection 
strategies. Their generic scheme is described in Algorithm 1. Its termination with 
a value d Æ co indicates that the weighted graph G = (V, E, A, w) has a path from 
s to t of weight d, whereas termination with d = oo signals that distg(s,t) = oo. 

Algorithm 1 maintains a set of 


1 g:= |s 0,v m œ:v Æ s] frontier nodes C and a map- 
2 C:= {s} ping g: V > Qso0U {co} such 
3 while C Æ 0 do that g(w) is the weight of the 
4 v := arg minec S(g,v) best known path from s to w. 
5 if v = t then return g(t) In Line 4, a selection strategy 
6 for (v,a, w) € E do S determines which node v 
7 if g(v) + (v,a, w) < g(w) then to expand next. Starting from 
8 g(w) := g(v) + (v,a, w) Line 6, a successor w of v is 
9 C :=CU{w} added to the frontier if its dis- 
10 C :=C\ {v} tance improves. 
11 return co Let h: V > Q>o U {oo} 
Algorithm 1: Directed search algorithm. estimate the distance from all 


nodes to a target t € V. The 
selection strategies sending (g, v) respectively to g(v), g(v) + h(v) or h(v) yield 
the classical Dijkstra’s, A" and greedy best-first search (GBFS) algorithms. 
When instantiating S with Dijkstra’s selection strategy, a return value d # co 
is guaranteed to equal distg(s,t). This is not true for A* and GBFS. However, 
if h fulfills the following consistency properties, then A* also has this guarantee: 
h(t) = 0 and h(v) < u(v,a,w) + h(w) for every (v,a,w) € E (see, e.g., [52]). 
In the setting of infinite graphs, unlike GBFS, A“ and Dijkstra’s selection 
strategies guarantee termination if distg(s,t) 4 co. Yet, we introduce unbounded 
heuristics for which termination is also guaranteed for GBFS. Note that these 
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guarantees would vanish in the presence of zero weights. An infinite path m is 
a sequence of nodes (v;),¢,4 and actions (a;),-) such that (vi, ai, vi41) € E for 
all i € N. We say that heuristic h is unbounded (w.r.t. G) if for every infinite 
simple path vp, v1, v2,... of G and for every b € Qso, there exists an index i s.t. 
h(v;) > b. In other words, unboundedness forbids an infinite simple path of G 
to “cap” at some distance estimate b. The following technical lemma enables to 
prove termination of GFBS in the presence of unbounded heuristics. 


Lemma 1. If G is locally finite, then the following holds: 


1. The set of paths of weight at most c € Qso starting from node s is finite. 
2. Let W CV. The set distg(W, t) := {diste(w,t) : w € W} has a minimum. 
3. No node is expanded infinitely often by Algorithm 1. 


Theorem 1. Algorithm 1 with the greedy best-first search selection strategy al- 
ways finds reachable targets for locally finite graphs and unbounded heuristics. 


Proof. First observe that Algorithm 1 satisfies this invariant: 


if g(v) A œ, then g(v) is the weight of a path from s to v in G 
whose nodes were all expanded, except possibly v. (*) 


Assume distg(s,t) 4 oo. For the sake of contradiction, suppose t is never 
expanded. Let K; be the subgraph of G induced by nodes expanded at least 
once within the first 2 iterations of the while loop. In particular, Kı is the 
graph made only of node s. Let K = Kı U K2U---. By Lemma 1 (3), no node is 
expanded infinitely often, hence K is infinite. Moreover, K has finite out-degree, 
and each node of K is reachable from s in K by (*). Thus, by König’s lemma, 
K contains an infinite path vo, v1,... € V of pairwise distinct nodes. 

Let w be a node of K minimizing distg(w, t). That minimum is well-defined 
by Lemma 1 (2). Since s € Kı C K and t is reachable from s, we have 
distg(w,t) < distg(s,t) < oo. By minimality of w Æ t, there exists an edge 
(w,a,w’) of G such that distg(w’,t) < distg(w,t) and w’ does not appear in K. 
Note that w’ is added to C at some point, but is never expanded as it would 
otherwise belong to K. Let ¿i be the smallest index such that w belongs to K;. 
Since h is unbounded, there exists j such that h(v;) > h(w’) and v; is expanded 
after iteration i of the while loop. This is a contradiction as w’ would have been 
expanded instead of vj. 


4 Directed Reachability 


In this section, we explain how to instantiate Algorithm 1 for finding short(est) 
firing sequences witnessing reachability in weighted Petri nets. Since Dijkstra’s 
selection strategy does not require any heuristic, we focus on A* and greedy best- 
first search which require consistent and unbounded heuristics. More precisely, 
we introduce distance under-approximations (Section 4.1); present relevant con- 
crete distance under-approximations (Section 4.2); and put everything together 
into our framework (Section 4.3). 


10 M. Blondin et al. 


4.1 Distance Under-approximations 


A distance under-approzimation of a weighted Petri net N = (P,T,f,A) is a 
function d: NP x NP — Qso U {oo} such that for all m, m’, m” € NP: 

— d(m, m’) < distw(m, m’), 

— d(m, m”) < d(m,m’') + d(m’,m") (triangle inequality), and 

— d is effective, i.e. there is an algorithm that evaluates d on all inputs. 


We naturally obtain a heuristic from d for a directed search towards marking 
Mearget- Indeed, let h: NP —> Qso U {oo} be defined by h(m) = d(m, Mtarget)- 
The following proposition shows that h is a suitable heuristic for A‘: 


Proposition 1. Mapping h is a consistent heuristic. 
Proof. Let m,m’ € N? and t € T be such that m +s m’. We have: 


A(m) = d(m, Mtarget ) by def. of h) 


< d(m,m’) + d(m', mtarget) by the triangle inequality) 
t j 

since m >y m’) 

by def. of h). 


< A(t) T d(m’, Mearget ) 


( 
( 

< disty(m,m’) + d(m’, Mtarget) (by distance under-approximation) 
( 

= X(t) + h(m') ( 


Moreover, AC arget) = C(Mtargets Mtarget) < distw (Miarget, Mtarget) = 0, where 
the last equality follows from the fact that weights are positive. 


4.2 From Petri Net Relaxations to Distance Under-approximations 


We now introduce classical relaxations of Petri nets which over-approximate 
reachability and consequently give rise to distance under-approximations. The 
main source of hardness of the reachability problem stems from the fact that 
places are required to hold a non-negative number of tokens. If we relax this re- 
quirement and allow negative numbers of tokens, we obtain a more tractable re- 
lation. More precisely, we write m Ż z m' iff m! = m+A,. Note that transitions 
are always firable under this semantics. Moreover, they may lead to “markings” 
with negative components. 

Another source of hardness comes from the fact that markings are discrete. 
Hence, we can further relax —>z into —>ọ where transitions may be scaled down: 


m bom’ < m =m +ô: A, for some 0 < 6 <1. 
One gets a less crude relaxation from considering nonnegative “markings” only: 
m Sos, m <=> (m > 8- gi) and (m’=m+6-A;) for some 0 < ô <1. 


Under these, we obtain “markings” from Q? and QF, respectively. Petri nets 
equipped with relation +g,, are known as continuous Petri nets [14,15]. 
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To unify all three relaxations, we sometimes write m a m’ to emphasize 
the scaling factor 6, where 6 = 1 whenever G = Z. Let dg: NP xN? > QsoU{oo} 
be defined as dg(m,m’) := œ if m Äg m’, and otherwise: 


dg(m,m’) := min [5 ôi- Alti): m S m') ; 


t=] 


In words, dg(m, m’) is the weight of a shortest path from m to m’ in the graph 
induced by the relaxed step relation +g, where weights are scaled accordingly. 
We now show that any dg, which we call the G-distance, is a distance under- 
approximation, and first show effectiveness of all dg. It is well-known and readily 
seen that reachability over G € {Z,Q} is characterized by the following state 
equation, since transitions are always firable due to the absence of guards: 


mgm <= Jo € Glo: m =m+ Y a(t): A. 
teT 


Here, ø can be seen as the Parikh image of a sequence o leading from m to m’. 
Proposition 2. The functions dz, dg, dQ., are effective. 


Proof. By the state equation, we have: 


dg(m,m’) = min > Mt) - a(t): 0 € Glom =m+t+ X` a(t) a 


teT teT 


Therefore, dg(m,m’) (resp. dz(m,m’)) are computable by (resp. integer) linear 
programming, which is complete for P (resp. NP), in its variant where one must 
check whether the minimal solution is at most some bound. 

For dg,,, note that the reachability relation of a continuous Petri net can 
be expressed in the existential fragment of linear real arithmetic [8]. Hence, 
effectiveness follows from the decidability of linear real arithmetic. 


Altogether, we conclude that dg is a distance under-approximation. Further- 
more, we can show that dg yields unbounded heuristics, which, by Theorem 1, 
ensure termination of GBFS on reachable instances: 


Theorem 2. Let G € {Z,Q, Qso}, then dg is a distance under-approzimation. 
Moreover, the heuristics arising from it are unbounded. 


Proof. Let N = (P,T,f,A) be a weighted Petri net. Effectiveness of dg follows 
from Proposition 2. By definitions and a simple induction, Syn C “+g for any 
sequence o € T*, with weights left unchanged for unscaled transitions. This 
implies that dg(m,m’) < distw (m, m’) for every m,m’ € GP. Moreover, the 
triangle inequality holds since for every m, m’, m” € G? and sequences ø, 0’: 


o paal Ws x ao’ " 
mgm —c¢m implies m —»g m`. 
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Let us sketch the proof of the second part. Let Mtarget be a marking and let 
he be the heuristic obtained from dg for Mtarget. Since hg(m) < hg(m) for all 
m and G € {Z, Qso}, it suffices to prove that dg is unbounded. Suppose it is 
not. There exist b € Qso and pairwise distinct markings mo,mj1,... each with 
he(m,;) < b. Let x; be a solution to the state equation that gives hg(m,). By 
well-quasi-ordering and pairwise distinctness, there is a subsequence such that 
Mi (p) < Mi (p) < --- for some p € P. Thus, limj_,o Mtarget(p) — m, (p) = 
—oo, and hence limj—oo £i, (s) = 00 for some s € T with A,(p) < 0. This means 
that b > he(mi;) = Jer A(t) - xı; (t) > b for a sufficiently large j. 


4.3 Directed Reachability Based on Distance Under-approximations 


We have all the ingredients to use Algorithm 1 for answering reachability queries. 

A distance under-approximation scheme is a mapping D that associates a dis- 
tance under-approximation D(M) to each weighted Petri net M. Let hpi) miarset 
be the heuristic obtained from D(NV) for marking mtarget. By instantiating Al- 
gorithm 1 with this heuristic, we can search for a short(est) firing sequence wit- 
nessing that Mtarget is reachable. Of course, constructing the reachability graph 
of N would be at least as difficult as answering this query, or impossible if it is 
infinite. Hence, we provide Gy(N) symbolically through N and let Algorithm 1 
explore it on-the-fly by progressively firing its transitions. 

For each G € {Z, Q, Qo}, the function Dg mapping a weighted Petri net M 
to its G-distance dg is a distance under-approximation scheme with consistent 
and unbounded heuristics by Proposition 1, Theorem 1 and Theorem 2. Although 
Algorithm 1 is geared towards finding paths, it can prove non-reachability even 
for infinite reachability graphs. Indeed, at some point, every candidate marking 
m € C may be such that how) marger (M) = 00, which halts with oo. There is 
no guarantee that this happens, but, as reported e.g. by [23,8], the G-distance for 
domains G € {Z,Q,Q>o} does well for witnessing non-reachability in practice, 
often from the very first marking Minit- 


An example. We illustrate our approach with a toy example and Dg (the scheme 
based on the state equation over Q£). Consider the Petri net M illustrated on 
the left of Figure 1, but marked with Minit := [p1 : 0, p2: 0]. Suppose we wish to 
determine whether Minit can reach marking Mtarget *= [p1 : 0, p2: 1] in N. 

We consider the case where Algorithm 1 follows a greedy best-first search, 
but the markings would be expanded in the same way with A*. Let us abbreviate 
a marking [p1 : x, p2: y] as (x,y). Since A+, = (0,1), the heuristic considers that 
Minit Can reach Marget in a single step using transition tə (it is unaware of the 
guard). Marking (1,0) is expanded and its heuristic value increases to 2 as the 
state equation considers that both tọ and t3 must be fired (in some unknown 
order). Markings (2,0) and (1,1) are both discovered with respective heuristic 
values 3 and 1. The latter is more promising, so it is expanded and target (0, 1) 
is discovered. Since its heuristic value is 0, it is immediately expanded and the 
correct distance dist (Minit, Mtarget) = 3 is returned. Note that, in this example, 
the only markings expanded are precisely those occurring on the shortest path. 
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Handling multiple targets. Algorithm 1 can be adapted to search for some mark- 
ing from a given target set X C NP. The idea consists simply in using a heuristic 
hx: NP > Qso U {oo} estimating the weight of a shortest path to any target: 


hx (m) = MIN{ADN),mrarget (m) : Mitarget € X}. 
This is convenient for partial reachability instances occurring in practice, i.e. 


X= 1 Trütarget ENP: Mearget(P) ~p c(p)} where c € N?” and each ~pE {=, >}. 


5 Experimental Results 


We implemented Algorithm 1 in a prototype called FasTFORWARD [10], which 
supports all presented selection strategies and distance under-approximations. 
We evaluate FASTFORWARD empirically with three main goals in mind. First, 
we show that our approach is competitive with established tools and can even 
vastly outperform them, and we also give insights on its performance w.r.t. its 
parameterizations. Second, we compare the length of the witnesses reported by 
the different tools. Third, we briefly discuss the quality of the heuristics. 


Technical details. Our tool is written in C# and uses GUROBI [32], a state-of- 
the-art MILP solver, for distance under-approximations. Benchmarks were run 
on an machine with an 8-Core Intel® Core™ i7-7700 CPU @ 3.60GHz running 
Ubuntu 18.04 and with memory constrained to ~8GB. We used a timeout of 60 
seconds per instance, and all tools were invoked from a PYTHON script using the 
time module for time measurements. 

A minor challenge arises from the fact that many instances specify an upward- 
closed set of initial markings rather than a single one. For example, Minit(p) > 
1 to specify, e.g., an arbitrary number of threads. We handle this by setting 
Minit(p) = 1 and adding a transition tp producing a token into p. 

As a preprocessing step, we implemented sign analysis [29]. It is a general 
pruning technique running in polynomial time that has been shown beneficial 
for reducing the size of the state-space of Petri nets. Initially, places that carry 
tokens are viewed as marked. For each transition whose input places are marked, 
the output places also become marked. When a fixpoint is reached, places left 
unmarked cannot carry tokens in any reachable marking, so they are discarded. 


Benchmarks. Due to the lack of tools handling reachability for unbounded 
state spaces, benchmarks arising in the literature are primarily coverability in- 
stances”, i.e. reachability towards an upward closed set of target markings. We 
gathered 61 positive and 115 negative coverability instances originating from 
five suites [39,28,6,35,18] previously used for benchmarking [23,8,29]. They arise 
from the analysis of multi-threaded C programs with shared-memory; mutual 


5 The Model Checking Contest focuses on reachability for finite state spaces. 
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exclusion algorithms; communication protocols; provenance analysis in the con- 
text of a medical messaging and a bug-tracking system; and the verification of 
ERLANG concurrent programs. We further extracted the sypet suite made of 30 
positive (standard) reachability instances arising from queries encountered in 
type-directed program synthesis [24]. The overall goal of this work is to enable 
a vast range of untapped applications requiring reachability over unbounded 
state-spaces, rather than just coverability. To obtain further (positive) instances 
of the Petri net reachability problem, we performed random walks on the Petri 
nets from the aforementioned coverability benchmarks. To this end, we used the 
largest quarter of distinct Petri nets from each coverability suite, for a total of 
33. We performed one random walk each of lengths 20, 25, 30, 35, 40, 50, 60, 
75, 90 and 100, and we saved the resulting marking as the target. For nets with 
an upward-closed initial marking, we randomly chose to start with a number of 
tokens between 1 and 20% of the length of the walk. It is important to note that 
even with long random walks, instances can (and in fact tend to) have short wit- 
nesses. To remove trivial instances and only keep the most challenging ones, we 
removed those instances where any considered tool reported a witness of length 
at most 20, disregarding the transitions used to generate the initial marking. 
This leaves us with 127 challenging instances on which the shortest witness is 
either unknown or has length more than 20. Moreover, this yields real-world 
Petri nets with no bias towards any specific kind of targets. 
This table summarizes the characteristics of the various benchmarks: 


Number of places Number of transitions 
min. med. mean max. min. med. mean max. 


COVERABILITY 61 16 87 226 2826 14 181 1519 27370 
SYPET 30 65 251 320 1199 537 2307 2646 8340 
RANDOM WALKS 127 52 306 531 2826 60 3137 5885 27370 


Suite Size 


Tool comparison. To evaluate our approach on reachability instances, we com- 
pare FASTFORWARD to LOLA [53], a tool developed for two decades that wins 
several categories of the Model Checking Contest every year. LOLA is geared to- 
wards model checking of finite state spaces, but it implements semi-decision pro- 
cedures for the unbounded case. We further compare the three selection strate- 
gies of Algorithm 1: A*, GBFS and Dijkstra; the two first with the distance 
under-approximation scheme Dg, which provides the best trade-off between es- 
timate quality and efficiency. In fact, the other heuristics perform strictly worse 
on almost all instances. We also considered comparing with KREACH [17], a tool 
showcased at TACAS’20 that implements an exact non-elementary algorithm. 
However, it timed out on all instances with a larger time limit of 10 minutes. 
Figure 2 depicts the number of reachability instances decided by the tools 
within the time limit. As shown, all approaches outperform LOLA, with GBFS 
as the clear winner on the RANDOM-WALK suite and A" slightly better on the 
SYPET suite. Note that Dijkstra’s selection strategy sometimes competes due 
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Fig. 2. Cumulative number of reachability instances decided over time. Left: SYPET 
suite (semi-log scale). Right: RANDOM-WALK suite (log scale). 


to its locally very cheap computational cost (no heuristic evaluation), but its 
performance generally decreases as the distance increases. 

To show the versatility of our approach, we also benchmarked FASTFOR- 
WARD on the original coverability instances. Recall that coverability EXPSPACE- 
complete and reduces to reachability in linear time [45,51]. While exceeding the 
PSPACE-completeness of reachability for finite state-spaces [38,21], coverability 
is much more tame than the non-elementary complexity of (unbounded) reach- 
ability. We compare FASTFORWARD to four tools implementing algorithms tai- 
lored, some of which are specifically to the coverability problem: LOLA, BFc [39], 
ICOVER [29] and the backward algorithm (based on [1]) of MIST [28]. We did not 
test PETRINIZER, [23] since it only handles negative instances, while we focus on 
positive ones; likewise for QCOVER [8] since it is superseded by ICOVER. 
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Fig. 3. Cumulative number of (positive) coverability instances decided over time. Left: 
Evaluation on the original instances. Right: Evaluation on the pre-pruned instances. 


Figure 3 illustrates the number of coverability instances decided within the 
time limit. The left side corresponds to an evaluation on the original instances 
where FASTFORWARD performs pruning (included in its runtime). On the right- 
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hand side the pruned instances are the input for all tools, and the time for this 
pruning is not included for any tool. As a caveat, ICOVER performs its own pre- 
processing which includes pruning among techniques specific to coverability. This 
preprocessing is enabled (and its time is included) even when pruning is already 
done. Using FastTFoRWARD(A*, Dg), we decide more instances than all tools on 
unpruned Petri nets, and one less than BFC for pre-pruned instances. It is worth 
mentioning that with a time limit of 10 minutes per instance, FASTFORWARD(A*, 
Dg) is the only tool to decide all 61 instances. 
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Fig. 4. Runtime comparison against FF(A*, Dg) (left) and FF(GBFS, Dg) (right), in 


seconds, for individual instances without pre-pruning. Tools on the first column of each 
side include coverability and reachability instances, while those on the second column 
of each side include coverability only. Marks on the green lines denote timeouts (60 s). 


We also compared the running time of AY and GBFS with Dg to the other 
tools and approaches. For each tool, we considered the type of instances it can 
handle: either reachability and coverability, or coverability only. Figure 4 depicts 
this comparison, where the base approach is faster for data points that lie in the 
upper-left half of the graph. The axes start at 0.1 second to avoid a comparison 
based on technical aspects such as the programming language. Yet, LOLA, BFC 
and MIST regularly solve instances faster than this, which speaks to their level 
of optimization. We can see that FASTFORWARD outperforms ICOVER, LOLA 
and MIST overall. We cannot compete with BFC in execution time as it is a 
highly optimized tool specifically tailored to only the coverability problem that 
can employ optimization techniques such as Karp-Miller trees that do not work 
for reachability queries. 


Length of the witnesses. Since our approach is also geared towards the iden- 
tification of short(est) reachability witnesses, we compared the different tools 
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with respect to length of the reported one, depicted in Figure 5. Positive values 
on the y-axis mean the witness was not minimal, while y = 0 means it was. 
Note that the points for BFC must be taken with a grain of salt: it uses a differ- 
ent file format, and its translation utility can introduce additional transitions. 
This means that even if BFC found a shortest witness, it could be longer than a 
shortest one of the original instance. 
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Fig. 5. Length of the returned witness, per tool, compared to the length of a shortest 
witness. ICOVER is left out as it does not return witnesses. FF(A*, Do), FF (DIJKSTRA) 
and MIST are left out as they are guaranteed to return shortest witnesses. 


Still, the graph shows that reported witnesses can be far from minimal. For 
example, on one instance LOLA returns a witness that is 53 transitions longer 
than the one of FASTFORWARD(A*, Dg). Still, LOLA returns a shortest witness 
on 28 out of 43 instances. Similarly, FASTFORWARD(GBFS, Dg) finds a shortest 
path on 60 out of 83 instances®. In contrast, MIST finds a shortest witness on 
all instances since its backward algorithm is guaranteed to do so on unweighted 
Petri nets, which constitute all of our instances. Again, this approach is tailored 
to coverability and cannot be lifted to reachability. 


Heuristics and pruning. We briefly discuss the quality of the heuristics and 
the impact of pruning. The left-hand side of Figure 6 compares the exact dis- 
tance to the estimated distance from the initial marking.” It shows that it is 
incredibly accurate for all G-distances, but even more so for G = Qso. We ex- 
perimented with this distance using the logical translation of [8] and Z3 [49] as 
the optimization modulo theories solver. At present, it appears that the gain in 
estimate quality does not compensate for the extra computational cost. 

As depicted on the right-hand side of Figure 6, pruning can make some in- 
stances trivial, but in general, many challenging instances remain so. On average, 
around 50% of places and 40% of transitions were pruned. 


6 These numbers disregard instances where the tool did not finish or where a shortest 
witness is not known, i.e. no method guaranteeing one finished in time. 
T Z3 reported two non optimal solutions which explains the two points above the line. 
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Fig. 6. Left: initial distance estimation compared to the exact distance (points closer 
to the diagonal are better). Right: number of instances per percentage of places (left) 
and transitions (right) removed by pruning (rounded to nearest multiple of 10). 


6 Conclusion 


We presented an efficient approach to the Petri net reachability problem that 
uses state-space over-approximations as distance oracles in the classical graph 
traversal algorithms A* and greedy best-first search. Our experiments have shown 
that using the state equation over ore provides the best trade-off between com- 
putational feasibility and the accuracy of the oracle. However, we expect that 
further advances in optimization modulo theories solvers may enable employing 
stronger over-approximations such as continuous Petri nets in the future. 

Moreover, non-algebraic distance under-approximations also fit naturally in 
our framework, e.g. the syntactic distance of [55] and “a-graphs” of [24]. These 
are crude approximations with low computational cost. Our preliminary tests 
show that, although they could not compete with our distances, they can provide 
early speed-ups on instances with large branching factors. An interesting line of 
research consists in identifying cheap approximations with better estimates. 

We wish to emphasize that our approach to the reachability problem has the 
potential to also be naturally used for semi-deciding reachability in extensions of 
Petri nets with a recursively enumerable reachability problem, such as Petri nets 
with resets and transfers [3,19] as well as colored Petri nets [37]. These extensions 
have, for instance, been used for the generation of program loop invariants [54], 
the validation of business processes [59] and the verification of multi-threaded 
C and JAVA program skeletons with communication primitives [16,39]. Linear 
rational and integer arithmetic over-approximations for such extended Petri nets 
exist [12,9,34,31] and could smoothly be used inside our framework. 
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Abstract. We present an approach to synthesize relational invariants 
to prove equivalences between object-oriented programs. The approach 
bridges the gap between recursive data types and arrays that serve to rep- 
resent internal states. Our relational invariants are recursively-defined, 
and thus are valid for data structures of unbounded size. Based on intro- 
ducing recursion into the proofs by observing and lifting the constraints 
from joint methods of the two objects, our approach is fully automatic 
and can be seen as an algorithm for solving Constrained Horn Clauses 
(CHC) of a specific sort. It has been implemented on top of the SMT- 
based CHC solver ADTCHC and evaluated on a range of benchmarks. 


1 Introduction 


Relational verification is widely applicable during an iterative process of soft- 
ware development, when a high-level specification, a prototype implementation, 
or even an arbitrary previous version is compared to the current version and 
verified for the absence of newly introduced bugs. As software grows large, com- 
positionality becomes a crucial factor to achieve scalability of relational verifi- 
cation tasks: reasoning about pairs of entire programs is reduced to reasoning 
about pairs of modules or isolated components of code. Proofs found for one 
component can be reused while reasoning about another component, or even 
the system in a whole. Successful examples in large-scale verification projects 
include a step-wise refinement in seL4 [30] and the integration of model checking 
to software development workflow in AWS C Common [11]. 

In this work, we represent relational verification problems over object-oriented 
programs as Constrained Horn Clauses (CHC). A CHC is an implication in first- 
order logic that involves a set of unknown predicates. For a system of CHCs, we 
wish to find an interpretation for all predicates that validates all implications. 
CHCs are used in various tasks appearing in verification, e.g., finding loop in- 
variants or function summaries. For relational verification, a system of CHCs 
can be constructed by pairing components of code of two versions in lockstep 
and supplying it with relational pre- and post-conditions [14, 39, 44, 53]. State- 
of-the-art tools for solving CHC, e.g., [9,19,21,27,32], are based on Satisfiability 
Modulo Theories (SMT), e.g., [40,47], they gradually become more robust, as 
long as the programs under analysis do not have a mixed use of data structures. 

Verification conditions of real-world problems involve data structures such 
as arrays and Algebraic Data Types (ADTs) of unknown size, expecting the 
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proofs to capture (quantified or recursive) properties over countably infinite 
sets of elements. Arrays are being handled in loops and often require finding 
universally-quantified loop invariants [21]. ADTs, such as lists, maps, and sets, 
require reasoning by structural induction [47] and often rely on additional helper 
lemmas which are difficult to be synthesized automatically. For relational veri- 
fication tasks, where one program is over arrays, and another is over ADTs, the 
solvers should likely reason over quantified formulas and induction at the same 
time, which is currently challenging for most of the automated tools. 

We propose a set of new algorithms for solving CHCs constructed by pairing 
programs over arrays and ADTs. Because we deal with object-oriented programs, 
the data structures might be accessed and modified in any given method, and our 
pairing is done for each method separately. Relational proofs are synthesized over 
the data structures — they describe a relation that holds while simultaneously 
traversing pairs of elements by any of the methods. Our key idea is that not all 
methods may be needed for the actual synthesis. In fact, our algorithm generates 
a candidate proof by bridging a single pair of methods and then validates/repairs 
it on all others. In essence, we observe how pairs of inputs (or pairs of outputs) 
change the states, guess a candidate relation between elements of states, and 
(dis-)prove it on all other methods using an SMT-based theorem prover. 

Our synthesis strategy is customized for different classes of benchmarks via 
so called recipes. We present two recipes for the list ADT that are applicable, 
respectively, for (1) stacks and queues, and (2) sets, multisets, and maps. They 
both discover nontrivial invariants that need a recursive interpretation. We in- 
dependently generate its base and recursive cases. The key point in determining 
the relations is to automatically investigate how an input or an output affects the 
state. Finally, we discover auxiliary lemmas that provide additional properties 
about objects in isolation and help proving the inferred invariants are valid. 

Importantly, in contrast to a more lightweight CHC setting over numeri- 
cal theories (and even arrays) that can rely on an SMT solver to validate its 
recursion-free solutions, the validation of our recursive solutions is conducted 
by structural induction. We thus rely on recent advances in SMT-based fully 
automated theorem proving [55] that (since recently) supports arrays. The ex- 
periments have shown that the approach is reasonably fast in practice. Our 
contribution, while presented in the CHC context, can be lifted on the program 
analysis context and implemented in a range of robust verification tools that are 
designed to support compositionality [7,24]. 

The rest of the paper is structured as follows. A short outline on background 
and notation is given in Sect. 2. In Sect. 3, we give an overview of the approach. 
Then, Sect. 4 and Sect. 5 present our recipes. Finally, we give the evaluation 
details in Sect. 6, related work in Sect. 7, and conclude the paper in Sect. 8. 


2 Preliminaries 


An object O = (St, Init, (Op,,)ne{1,n]) is defined over internal states St, with 
initialization Init(s) denoting initial states s, and methods Op,,, also called op- 
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erations, for some identifier n (which for simplicity is treated as a natural number 
in some finite interval, but later sections liberally refer to Op,, by their name). 
Each operation Op,,(in, s,s’, out) defines transitions between a pair of states s 
and s’ for a given input in, producing an output out. Moreover, each operation 
has an associated precondition pre,,(in, s), ranging over the input and pre-state. 

In this paper, we take a syntactic approach by representing states as tuples of 
variables. Specifically, we assume that Init(s) and each operation Op(in, s,s’, out) 
is given as a predicate, i.e., as a characteristic formula, over the specified param- 
eters, that holds for initial states, respectively, when the program can take a 
particular transition. Such a formula can be obtained from the source code by 
symbolic execution, and we assume that effect of loops inside operations is cap- 
tured by quantified formulas, creation of which is an orthogonal problem. Hence, 
our approach is language agnostic. 

We assume that the programs under consideration are deterministic, and we 
assume that pre(in,s) => ds’, out. Op,,(in, s,s’, out). Note that for determin- 
istic programs, the existential quantifier in ds’, out. Op,,(in, s,s’, out) can be 
eliminated if pre(in, s) holds as s’, out are functionally determined by in, s. 

We aim at solving a relational verification problem over two objects and 
reduce it to inductive invariant inference over a composition of two objects. 


Definition 1. Two objects A and C are equivalent if there exists an inductive 
invariant R over a composition of these objects, which satisfies all clauses below. 
It connects two states St^ and St? before and after each pair of operations 
(Opi, Ope )ne[1,N]- 


initialization: 
Init“ (as) A Init? (cs) —> R(as, cs) 
consecution: 


R(as,cs) ^ Op#\(in, as, as’, out“) A Op? (in, cs, cs’, out©) => R(as’, cs’) 


R(as,cs) A Op(in, as, as’, out“) A OpS(in, cs, cs’, out?) => R(as', cs") 
safety: applicability: 
R(as,cs) A pref (in,as) =— pre? (in, cs) 


R(as,cs) A pref (in, as) => pref (in, cs) 


R(as,cs) A prex(in, as) —> pre& (in, cs) 
R(as,cs) A preS(in, as) => preĝ (in, cs) 
safety: outputs: 


R(as,cs) \ Op#{(in, as, as’, out“) A Op? (in, cs, cs’, out?) => out“ = out? 


R(as, cs) A Ops (in, as,as', out“) A Op$ (in, cs, cs', out?) —> out4 = out? 
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Implications in Def. 1 define a set of Constrained Horn Clauses (CHC) over 
an uninterpreted relation symbol R. There are three types of constraints: (1) ini- 
tialization, (2) consecution, and (3) safety. The third, safety, reflects the actual 
relational specification, i.e., the correspondence between the programs under 
analysis, in terms of the user-visible variables, namely the input in, and the 
respective outputs, out and out’. Here, safety is divided into applicability (coin- 
cidence of preconditions) and equivalence of outputs, which together ensure that 
the two programs are observationally equivalent. To prove that this equivalence 
holds, one needs to infer a more complicated invariant R over the internal state. 
For this reason, we need the initiation and the consecution constraints: whatever 
happens due to each operation, the invariant is maintained, and by safety, the 
programs remain observationally equivalent indefinitely. 


Problem Statement: We seek an interpretation of R that satisfies all con- 
straints in Def. 1 simultaneously. This conventional formulation of a CHC task 
lets us to use any off-the-shelf CHC solver. However, the problem is undecidable 
in general, thus no solver guarantees to handle our specific tasks. Furthermore, 
existing solvers mainly support the lightweight arithmetic theories, and a few 
exceptions support also ADTs [27] and arrays [21,32]. To the best of our knowl- 
edge, there is no CHC solver that supports ADTs and arrays at the same time, 
and there is no CHC solver that synthesizes recursive solutions. 

Context: The system of CHCs ensures that A and C can be substituted 
interchangeably in any calling context, and it is applicable to a wide range of 
techniques for formal program development. The focus on equivalence instead of 
subsumption is not essential for our work, and the presented approach works for 
the asymmetric case just the same. Specifically, Liskov and Wing’s substitution 
principle [36] follows (precondition strengthening is reflected by the applicability 
constraints from pre“ to pre? , and all postconditions with respect to the outputs 
are equivalent). Data Refinement [15,25] follows similarly (Def. 1 characterizes 
that R is a forward simulation [37]). See Sect. 7 for more details. 


3 Synthesis of Recursive Relational Invariants 


In this section, we present the fundamentals of the approach to synthesize recur- 
sive relational invariants for systems over arrays and ADTs that we instantiate 
and illustrate on examples in the subsequent sections. 


3.1 Overview 


Our approach is purely symbolic and fully automatic in both stages: generating 
a candidate relational invariant, and proving it correct (i.e., validating). The key 
insight is an analysis of the operations joint in the constraints of Def. 1. We follow 
a strategy of introducing recursion into the interpretation based on ADTs, and 
by aligning the base case to initialization and the recurrence conditions to joint 
operations. In particular, a relational invariant R that bridges an algebraic list xs 
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Algorithm 1: Automated synthesis of recursive relational invariants 
Input: Objects A = (as, Init*, (Op4)nen) and C = (es, Init? , (Op©) nen), 
where as, cs are the state variables, and xs is a list variable of as 

Output: relational invariant R between A and C 

1 R(nil, cs) + Init4(as[2s = nit) A Init? (cs); 

2 Ọr + true; 

3 let y and ys be fresh variables; 

4 while true do 

5 cs, + UppatE(Op4, Op, as[xs := cons(y, ys)], cs) for some n € N; 

br < r A Matcu(Op*, Op$ , as[xs := cons(y, ys)], cs, csr) for some m € N; 

R(as|xs := cons(y, ys)], cs) — r A R(as[xs := ys], csr); 

if VALIDATE(R, A, C) then return R; 


aro 


and an array (with auxiliary variables, such as index) cs is defined recursively 
over the structure of xs, which produces this general schema: 


dy(cs) if zs = nil 
J csr. br (y, Ys, cs, csr) A R(ys, csr) if zs = cons(y, ys) 


(1) 


u 


R(zs, cs) = i 


This schema has two placeholders for constraints, œ in the base case and ¢, 
in the recursive case, that may refer to the variables in scope (as indicated 
by their respective parameter lists). Moreover, we seek a Skolem function to 
eliminate the existentially-quantified state variable cs, in the recursive position. 
Intuitively the desired Skolem function captures the delta between two array 
states that corresponds to the delta between xs and ys. 

Alg. 1 gives our top-level synthesis procedure for interpretations of R. It takes 
as input two objects, A and C, where as and cs are tuples variables that represent 
their respective states. We refer to primed versions of these state variables to as 
as’ and cs’, assuming that all as, cs, as’, and cs’ are distinct. The algorithm 
works with algebraic lists specifically and thus as is assumed to have such a 
component given by the state variable zs. We denote by as|zs := e] the updated 
vector of variables such that xs is replaced in as by symbolic expression e. 

The base case of the interpretation of R is straightforward (line 1): the al- 
gorithm uses a predicate Init© and a predicate Init“ in which the zs variable 
is instantiated to nil. The inductive case of the interpretation of R is trickier 
(line 7). Because several different operations that produce state, consume state, 
or do nothing with a state are possible (see Def. 2 later in the section), some of 
them might contribute to different parts of the interpretation being synthesized. 
In particular, methods MATCH and UPDATE are responsible for generating a 
body of R. They are instantiated differently for our two recipes in Sect. 4 (ap- 
plicable for stacks and queues) and Sect. 5 (applicable for (multi)sets and maps). 

The first method, UPDATE, synthesizes an updated symbolic state cs,, a 
tuple of symbolic expressions, to be used in the nested inductive call of R. 
It can therefore be understood to compute a witness (or Skolem function) to 
existential quantifier in Eq. (1) as an expression of the remaining variables in 
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scope, y, ys, as, cs. The second method, MATCH then collects constraints ¢, from 
suitable transitions w.r.t. this csp. 

In a loop for each candidate interpretation of R, our algorithm runs an 
automated SMT-based theorem prover [55] to validate it (line 8). The algorithm 
can iterate several times and converges after a successful theorem-prover run. 

A noteworthy feature of our framework is that UPDATE and MATCH should 
not necessarily be synchronized in pairs. Although cs, and the result of MATCH 
are going to be eventually combined and used in a single formula, the nonde- 
terministic nature of our synthesis procedure suggests that the two ingredients 
may originate from potentially non-joint operations, thereby enlarging the search 
space of possible relational invariants. 


3.2 Classifying Operations 


Our particular strategies for choosing ingredients for the inductive interpretation 
of R are based on the classification of the operations of the abstract object. 

We define a partial ordering “<” on ADT states that connects constructors 
discerned by the recurrence in R to the transitions of operations. With respect 
to this ordering, we can for example recognize operations that leave the ADT 
unchanged (“noops”, which play a special role in Sect. 5), operations that “pro- 
duce” constructors and thereby enlarge the internal state by additional elements 
and conversely operations that “consume” constructors. A natural choice for < 
is the reflexive closure of the subterm ordering, where zs < ys for lists specifies 
that zs is a suffix of ys. In general, this ordering can be used to control the result 
of the synthesis for specific applications, and is a heuristic choice. A choice which 
works well for our examples is that xs is a non-strict subsequence of ys. 

The < ordering naturally extends to tuples of variables (and thus, states), 
and lets us classify operations into the following three kinds. 


Definition 2. Let Op be an operation of an abstract object. Then, 


IsNo(Op) = Vi, s,s’,0. Op(i,s,s',0) => s=s' 
IsPROD(Op) = Vi, s,s’,0. Op(i,s,s',0) —> s < s \-1sNo(Op) 
ISCONSM( Op) = Vi,s,s',0. Op(i, s,8’,0) => s < sA-1SNO(Op) 


Example 1. The class of an operation can often be identified by a cheap syntactic 
check to recognize when cons is applied to a current state or a next state variable. 
In the upcoming stack example in Fig. 1, from xs’ = cons(in, xs) we have that 
push is a producer operation, and from cons(out, xs’) = xs we classify pop as 
consumer operation. A top operation, not shown in Fig. 1, would be recognized 
as a noop (see also hasElement in the upcoming example in Fig. 3). 


In the next two subsections, we introduce our particular strategies for the im- 
plementations of UPDATE and MATCH of Alg. 1, in reference to Def. 2. Some 
operations fall into neither of the classes; or it may be hard to determine so if 
they do, given that Def. 2 is semantic; and different operations may contribute 


30 G. Fedyukovich and G. Ernst 


different ingredients for a correct definition of R. To make use of as many oper- 
ations as possible, we suggest strategies for all three classes of operations, to be 
able to synthesize a relational invariant in complex cases, even when complete 
information about the system is difficult to obtain. 


4 Recipe 1: Linear Scan 


We identify a class of problems that require scanning the arrays in implementa- 
tions of stacks and queues linearly. A distinguishing feature in this class is the 
presence of a numeric variable in cs through which array cells are accessed (de- 
noted index in the rest of the section). We first illustrate the synthesis process 
on the following example and then present the algorithmic details. 


4.1 Motivating Example 


Two realizations of a FIFO stack are shown in Fig. 1: one is based on linked lists, 
and another is based on arrays. They share a common interface of initialization 
and the two operations push and pop. For example, the encodings of pop of 
ListStack and ArrStack are respectively: 


ListStack 
Op 


pop (xs, xs’, out) 


= (rs Æ nil A as’ = qs. tail A out = xs.head) 
= (as = cons(out, xs’)) (after simplification) 
Op ea, n, a’, n’, out) 


pop 
=(n>0Ad@ =a^n=n-— 1^ out = afn']) 


where zs Æ nil and n > 0 are the preconditions, and out captures the return 
value. As an illustration, formula Op;¢s*****(s, _,7) holds for all states s in 
which pop terminates and returns 7 (by convention we use _ to denote terms 
that are irrelevant in a particular context). Note also that in the implementation 
of ArrStack, the popped value is not erased from the array — in order for afn] 
to be considered in the future, it has to be rewritten by some push operator. In 
general, the array always contains infinitely many unknown values outside the 
range of cells a[0],...,a[n — 1] which are never accessed. 

A possible relational invariant R(xs,n,a) bridging ListStack and ArrStack 
is defined as follows: 


n=O if ws = nil 
He) 2 >O0Ay=aln—1A R(ys,n—1,a) if zs = cons(y, ys) (2) 


Intuitively, this R captures that a list zs has the same content as the portion of 
an array a between indexes 0 (including) and n (excluding). When zs is empty, 
then the portion of a should be empty too, thus n = 0. Otherwise, xs is created 
by cons-ing some other list ys and an element y then (1) n should be strictly 
positive, and (2) y should belong to the designated portion of a. 
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class ListStack: class ArrStack: 
def init(): def init(): 
xs = nil n=0 
a Safani 
def push(in): def push(in): 
xs = cons(in, xs) a[n] = in 
n=n+1 
def pop(): 
assert xs != nil def pop(): 
out = xs.head assert n> 0 
xs = xs.tail n=n-1 
return out return a[n] 


Fig. 1: Two implementations of a FIFO stack. 


Op4 (in, out) Opô (in, out) 
cons(y, ys) —————>_ ys ys ——=—— > cons(y, ys) 
R R R R 
OpS (in, out) OpS (in, out) 
cs —— = > os, csr ——— > cs 


Fig. 2: Transitions of consumer operations (left) and producer operations (right) used 
to instantiate Eq. (1). 


The schema in Sect. 3.1 has two placeholders for constraints, ¢, in the base 
case and ¢@, in the recursive case, that may refer to the variables in scope (as 
indicated by their respective parameter lists). Moreover, we seek a state cs, in 
the recursive position. Placeholder ¢y is instantiated by constraints from the ini- 
tialization operations, such as n = 0 from ArrStack. This alignment of base case 
and initialization is not just a coincidence: many data structures start initially 
empty and are gradually populated by calling operations (e.g., collections). 


The purpose of ¢,. in the recursive case of Eq. (1) is twofold. First, it connects 
a portion of the ADT state (specifically y) to the array state cs, in the example 
via a[n — 1] = y, and it determines a suitable array state cs, as an argument of 
the recursive occurrence of R. For instance, we take n — 1 for the recursive call 
but leave a unchanged. This is motivated by the observation that a state where 
xs = cons(y, ys) for some y, ys is consumed by pop. Using this information, the 
recurrence of R must align with the corresponding array transitions, too, as 
shown in Fig. 2 on the left. The constraint n > 0 is the precondition of the array 
operation, whereas y = a[n — 1] follows from comparing the outputs. As shown 
in Fig. 2 on the right, we can dually base the recurrence on push, which produces 
a cons, i.e., a transition from ys to zs = cons(y, ys) for some y. In this case, both 
transitions need to be viewed in reverse such that the respective successor states 
of push now match the left side R(as, cs) of the schema. Then, the assignment 
n = n + 1can be rewritten to yield the equation n, = n — 1. 
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Algorithm 2: UPDATE (recipe 1) 


Input: Operations Op* and OpŪ, 
as|rs := cons(y, ys)| the shape of the state of A, 
cs the state variables of C, assuming cs = (_, index,a) where index 
and a are variables of integer and array types, resp. 
Output: Updated arguments cs, 
1 if IsPROD(Op*) then 


2 let cs, = (_, index’, a’) be s.t. Vin, Jout . Op" (in, csr, cs, out); 
3 return (_, index’,a); 

4 if isConsM(Op“) then 

5 let cs, = (_, index’, a’) be s.t. Vin, Jout . Op© (in, cs, csr, out); 
6 return (_, index’, a); 


Algorithm 3: MATCH (recipe 1) 


Input: Operations Op* and Op®, 
as|rs := cons(y, ys)| the shape of the state of A, 
cs the state variables of C, 
cs, the updated state of C, assuming cs, = (_, index’,a) where index’ 
and a are variables of integer and array types, resp. 
Output: Formula ¢, 
if isPROD(Op“) then 
inv + GeTLoopINvaRIANT(indes’, Op®); 
return inv A -JInit© (cs) A y = alindex'); 
if isConsM(Op“) then 
return prea A prec Ay = alindez’; 
return true; 


anu k WO NY 


To make this intuition practical, our approach suggests a particular strat- 
egy for picking operations to take constraints from, recognizing consumers and 
producers more generally, and validating the guessed relational invariants using 
induction and lemmas. 


4.2 Algorithm Description 


Alg. 2 and Alg. 3 show the implementations of UPDATE and MATCH, respectively, 
that suit stacks and queues. Recall that these algorithms are called from Alg. 1 
and take as input pairs of nondeterministically chosen joint operations of A 
and C; state variables cs of C; current version of state variables cs, to be used 
in the recursive call of R; and fresh variables y and ys introduced in Alg. 1 to 
define the inductive rule of R. Outputs of UPDATE and MATCH are respectively 
an updated tuple of variables cs, and a subformula ~ to be conjoined with the 
inductive definition of R. 

If the producing operator is picked (line 1 of Alg. 2), then we have to find a 
term index’, such that it would be transitioned by Op to index. In particular, 
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after assigning a new value to an array cell, index is monotonically updated (i.e., 
incremented like in the example in Fig. 1, or decremented). Thus, to access the 
array cell containing a new value using an updated value of index, we have to 
invert the arithmetic operation and obtain index — 1 (for Fig. 1) or index +1 (in 
the case of decrementation). Technically, in Alg. 2, it is realized by taking the 
index variable from cs, through which cells of the array can be observed (e.g., n 
in example in Fig. 1) and finding such a term index’, that would be transitioned 
by Op“ to index. Thus, the resulting cs, is composed from the same ingredients 
as cs where index’ replaces indez. 


If the consuming operation is picked (line 4), then we proceed in the reverse 
direction and find index’ that is a result of transitioning of index through Op“. 


Alg. 3 for this recipe relies on the output of Alg. 2. Interestingly, it is sup- 
ported even if cs, is computed using the producer, but wy in Alg. 3 is computed 
using the consumer. Our particular strategy for the consumers in this recipe is 1) 
to use the precondition for Op©, and 2) to bridge the outputs of Op* and Op© 
via an equality. Alternatively, the inference via producer in line 1, in comparison, 
misses important constraint in the example, as the precondition of push is trivial. 
Such a situation can be mitigated by the discovery of a loop invariant (line 2) 
over index, i.e., usually just using Linear Integer Arithmetic (LIA), adding it, 
and blocking the initial state (to distinguish from the base case of the definition 
of R) in the inductive case of the interpretation of R being synthesized. Loop 
invariants are generated as follows as interpretations of predicate inv satisfying 
the following two implications: 


Init (cs) => inv(cs) 


inu(cs) A ( VV Op® (in, cs, cs’, ot) = inv(cs’) 


neEeN 


Note that these CHCs (over LIA) can be solved by numerous existing ap- 
proaches. Without a query, ideally the strongest loop invariant is desirable; how- 
ever in practice it suffices to apply lightweight techniques based on forward- 
propagation of initial states using quantifier elimination, followed by its inductive 
subset computation [20]. This often finds an adequately-strong invariant. 


Example 2. Recall the stack example in Fig. 1. Let the index’ term be computed 
by Alg. 2 via inverting the increment operation in push. Thus, it is used as an 
argument of the nested call to R in the inductive case of the definition of R. 
By construction, the a[index’] cell contains a value of in, i.e., the argument of 
push. At the same time, in is the argument of cons in Op* representing push, 
which lets us bridge the array and ADT in the proof. To allow this, Alg. 3 
takes argument y of cons from the inductive definition of R, and equates it with 
alindex’], producing y = a|n — 1]. Combining it all together, we get the final 
solution, as shown in (2). 
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class ListSet: class ArraySet: 
def init(): def init(): 
xs = nil a = [false, false, ...] 
def hasElement(in): def hasElement(in): 
return contains(xs, in) return a[in] 
def insert(in): def insert(in): 
xs = cons(in, xs) a[in] = true 
def erase(in): def erase(in): 
xs = removeall(xs, in) a[in] = false 


Fig. 3: Two implementations of a set, where the list is not necessarily duplicate-free. 


5 Recipe 2: Noop-based synthesis 


In this subsection we present a recipe that suits sets, multisets, and maps, that 
are in some sense non-linear. That is, data structures do not maintain any index 
variable, which is usually used to access elements. Instead, arrays are viewed as 
maps, and the corresponding ADTs are equipped with recursive functions that 
traverse the data structure over and over again for each input. Oftentimes, these 
objects have noop operations, and our synthesis procedure makes use of them. 


5.1 Motivating Example 


Fig. 3 shows two implementations of a set. The list-based implementation stores 
elements in the order of their insert-ions. The elements are not removed unless 
erase is called explicitly. Thus, duplicate entries of the same elements are al- 
lowed. The implementation uses the recursive contains and removeall functions 
that both traverse the list and search for a specific element: 


false if ts = nil 


a { (a = y) V contains(ys,a), if as = cons(y, ys) 


nil if zs = nil 
removeall(xs,a) = ¢ ite(a = y, removeall(ys, a), 
cons(y, removeall(ys,a))) if zs = cons(y, ys) 


The array-based implementation handles a map a from elements to Booleans. 
Initially, all cells in a are false. Inserting and removing an element is implemented 
by storing true and false to the corresponding cell respectively. The difficulty 
here is to support the shown implementation of insert and erase in Fig. 3, as 
well as possible variants that e.g., eagerly prune duplicate entries in the list-based 
implementation (see Sect. 6). 

The expected output of our synthesis procedure is as follows: 


R _ Jf Yz. ~alz] if zs = nil 
ay aly] A R(ys, aly := contains(ys,a)]), if zs = cons(y, ys) 


(3) 
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Algorithm 4: UPDATE (recipe 2) 


Input: Operations Op“ and Op© such that 1sNo(Op^) holds, 
as|rs = cons(y, ys)| the shape of the state of A, 
cs the state variables of C 
Output: Updated arguments cs, 
1 let cs’ be fresh variables; 
2 6< Op*(y, as[zs := ys], as[xs = ys], out) A Op@(y, cs’, _, out); 
3 p} Yz.z #y = Jout . Op? (z, cs, _, out’) \ Op? (z, cs’, _, out’); 
4 assume QE(Jout.¢ A Y) simplifies to (cs’ = csr); 
5 return csr; 


Algorithm 5: MATCH (recipe 2) 


Input: Operations Op“ and Op© such that 1sNo(Op^) holds, 
as|rs = cons(y, ys)| the shape of the state of A, denoted aso below, 
cs the state variables of C, 
cs, the updated state of C 
Output: Formula r 
1 $ + Op*(y, aso, aso, out) A Op (y, cs, csr, out); 
2 return SIMPLIFY(QE(Jout .¢)); 


5.2 Algorithm details 


Alg. 4 and Alg. 5 show the implementations of UPDATE and MATCH, respectively, 
for this recipe. The arguments cs, of the nested call to R in the inductive case of 
the definition of R are computed in Alg. 4 using the symbolic encoding of noop. 
In the set example, noop is the hasElement operation, which allows observing 
the status of the internal state and does not modify it. We furthermore assume 
that the input of Op,, coincides with the type of elements stored in the list, i.e., 
it is meaningful to call Op,,(y,---) with the list head y from the recursive case 
of (1) where zs = cons(y, ys). 

The key idea behind Alg. 4 is to make necessary adjustments to cs to con- 
struct cs, that mirror any changes that can be observed via Op“ when tran- 
sitioning from list zs to ys in (1). This update is determined in terms of an 
auxiliary variables cs’ that are constrained to satisfy certain input/output pairs 
for the corresponding Op”, by case analysis whether the input is this partic- 
ular y that is removed by the recurrence. The primary intention is to reassign 
aly] appropriately. We do this by collecting constraints @ such that the output 
observed for Op© for y and cs’ matches that of the corresponding Op^ on the 
smaller state with ys. This is also the key difference to Sect. 4, where we heuris- 
tically keep a unchanged in the recursive call in (1). The outputs for all other 
inputs z, however, are enforced to be unchanged w.r.t. the original cs, which is 
expressed by the constraint yY. We then eliminate the quantifier for out (which 
is straightforward as the operations are deterministic) and rewrite the formula 
to closed expressions cs, for variables cs’ as result. 
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Example 3. Specifically for the example in Sect. 5.1, the algorithm proceeds by 
symbolic execution of hasElement, yielding formulas the following constituents: 


Op“ = (out = contains(ys, y)) 

Op? = (out = aly) 
ġ = (out = contains(ys, y) A out = a'[y]) 
p=(Vz.y#z Jout’ . out’ = da' |z] A out’ = afz]) 


The result Jout . p ^Y of Alg. 4 is now solved for a’. The only free variables refer 
to the states of the systems. Bound variables out and out’ can be eliminated by 
merging equalities over out and out’: 


a'y] = contains(ys,y) A (Vz.y#2 = > a’[z] = aļz]) 


The first conjunct therefore provides the update for a’[y], whereas the second 
conjunct of ¢ states that a'[z] should not be changed at indices other than y. 
After applying the axioms over the theory of arrays we get as result the following 
equality, which pattern matches the expected shape in line 4: 


QE (Aout. ¢) (a! = alx := contains(ys, x)]) 


This transformation requires to “reverse-apply” the axiom of extensionality, 
i.e., switch from the pointwise comparison of a and a’ to an equality between 
the entire arrays. Note that while in general quantifier elimination is difficult, 
our current implementation has a limited, but often sufficient, support that can 
be extended by supplying rules to the underlying SMT-based theorem prover. 

While Op“ Alg. 4 predict future outputs of Op* for input y, Alg. 5 exe- 
cutes Op“ on the state where zs = cons(y, ys) to obtain the current output of 
Op“ for the same y. The generated constraint simply expresses that the output 
of Opľ has to match. For hasElement we obtain the following formula: 


Jout . (contains(cons(y, ys), y) = out) A (aly] = out) 


Unfolding the definition of contains and simplification produces true = afz], 
which is then used as the “body” of the inductive case of R in (3). 


6 Evaluation 


We have implemented the approach in a prototype CHC solver called ADTCHC®, 
relying on ADTIND [55] as an inductive prover, which in turn uses the Z3 [40] 
SMT solver to quickly perform the satisfiability checks over uninterpreted func- 
tions and linear arithmetic that are needed at various solving stages. ADTCHC 
automatically determines the appropriate synthesis recipe through analyzing the 


3 The tool and benchmarks are available at https://github.com/grigoryfedyukovich/ 
aeval/tree/adt-chc. 
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syntax of the program (i.e., presence of index variables) and is able to successfully 
find relational invariants and prove them valid for all considered benchmarks. 

We have evaluated the approach from Sect. 3 on different realizations of 
text-book data structures. The evaluation aims at answering two questions. Is 
the approach effective in the first place to discover suitable relational invariants, 
and how well can the necessary induction proofs be automated? The latter is 
relevant since Alg. 1 crucially depends on VALIDATE in its refinement loop. 

All our benchmarks require recursive invariants. They fall into two cate- 
gories. First, stacks and queues from Sect. 4 (with variations that store values 
only to even indexes of the array) are solved based on linear scan. Second, 
sets, multisets, and maps, (that differ in whether, e.g., duplicate elements are 
stored in the respective lists) are solved with the approach in Sect. 5. We in- 
clude such variations to reflect different trade-offs when designing specifications, 
and to demonstrate that our technique is reasonably flexible. The only user- 
provided lemma was required for the multiset benchmark (marked * in Table 1): 
V a xs. num(a, zs) =0 = > remove(a,xs) = zs. 


The results from the evalua- Table 1: Invariant synthesis timings. 
tion* of both groups of benchmarks 
(resp., recipes used) are shown in Benchmark Variant Time (s) 


Table 1. The choice which recipe to 


use was made by the tool itself at Stack Pig 1 2.81 
synthesis time. Total time (in sec- Stack SKEN cells 2.79 
onds wall-clock) is entirely domi- Queue ordinary 40.61 
nated by proof search in ADTIND, Queue even cells 42.18 
and includes the time for SMT Set Fig. 3 2.12 
queries. We remark that the time Set no duplicates 19.24 
to synthesize the relational invari- Multiset* with remove 32.62 
ant is negligible in comparison to Multiset with clear 3.59 
the proof time (and the proof time Map duplicates 1.95 
is often proportional to the number Map no duplicates 5.83 


of internal SMT calls). 

Most proofs are found using the default proof strategy (the same for every 
benchmark) within 20s. This is caused by the large proof search space created 
by a combination of array simplification and forward rewriting. We have also 
tested our tool of buggy implementations, e.g., in which the consumer opera- 
tions are correct (and can be used for correct guesses of relational invariants), 
but producers are not. Expectedly, the tool is unable to synthesize a relational 
invariant for the whole systems in these cases. 

We have already presented the relational invariants found for the stack (2), 
for the stack variant that stores to even array indices only, counter n is de- 
creased by 2 instead of 1 in the recursive call as expected. Relational invari- 
ant R(xs,m,n,a) for the queue benchmarks keeps two indices into the array a, 
depending on the variant, the first element of the list xs is found at afm] or afn] 


4 The evaluation was conducted on MacBook Pro, Processor: 2 GHz Intel Core i5, 
Memory: 8 GB 1867 MHz LPDDR3, MacOS v10.14.6. 
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and the recursion either increases m or decreases n. The relational invariants 
for the multiset and map examples are analogous. All necessary lemmas are 
automatically discovered and proved by ADTIND, as an example for the set 
benchmarks: V ws, s,x. R(as,s) => contains(x,xvs)=s|2)]. 


7 Related Work 


Although there exist automated techniques to synthesize relational invariants, 
nothing was proposed to deal simultaneously with ADTs and arrays. Conceptu- 
ally, our approach is related to SIMABS, an SMT-based algorithm to simulation 
synthesis [18]. SIMABS exploits a space of possible simulations and (dis-) proves 
them using an off-the-shelf decision procedure. Guesses for simulation relation 
are obtained also from the source code, by matching variables from two pro- 
grams. Alternatively, simulation relations can be inferred from test runs [49] or 
through translation validation [41]. Our approach allows dealing with objects 
(not just imperative code) and contributes several novel strategies for guessing 
and proving non-trivial simulation relations. 

Discovery of invariants to relate the behaviors of two programs or other ways 
of establishing program equivalence is an active research area [5, 14,22,23, 39,44, 
51]. These approaches typically reduce the relational verification problem to a 
safety verification problem and rely on the existing tools—often, solvers for con- 
strained Horn clauses (CHC). Currently, since ADTs and arrays are challenging 
for the underlying solvers, the applicability of the approaches to our tasks are 
also limited. There are decision procedures for abstraction of ADTs to lists, sets, 
and multisets [52], however, these apply to certain predefined abstractions only. 

Our approach can be seen as an application of Syntax-Guided Synthesis (Sy- 
GuS) [2]. Strategies dependent on types of benchmarks essentially represent sets 
of syntactic templates filled iteratively and checked using an SMT solver. SyGuS 
is successfully used also in CHC solving [19,21] and in lemma synthesis [46,47,55]. 
There are only a few approaches [21,28,31,55] that apply SyGuS to synthesize 
formulas over ADTs or arrays/quantifier. Data-driven approaches are comple- 
mentary to such syntax-based approaches, e.g., [38]. Neither deals with arrays, 
quantifiers, and ADTs at the same time. 

Unno et al. [53] support recursive predicates, by taking the least solution 
of initialization and consecution as the definition of R, however, this may lead 
to rather cumbersome inductive cases (e.g., for pop in the stack). We avoid 
the problem by basing the recurrence scheme on the data structure, and infer 
constraints that are well aligned to that scheme from the operations. Jennisys [34] 
tackles the related problem of generating recursive implementations from an 
abstract model, where the simulation relation is given. 

More generally, the problem addressed in this work relates to the idea of 
step-wise refinement, originally conceived by [16] and [54] as a guideline to orga- 
nize software development and later studied extensively in a formal setting for 
rigorous assessment of functional correctness (e.g., [1, 4, 15, 25, 29, 33, 36]). The 
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standard proof technique relies on simulation relations [37] that couple the two 
state spaces, which is directly reflected in the CHC system of Def. 1. 

Many methods and tools support development using formal refinement [1, 4, 
8,17, 26,29,33,45]. Large-scale verification projects that are based on refinement 
include seL4 [30], FSCQ [10], Flashix [48], and CompCert [35], with high human 
effort involved. Correct-by-construction correspondence between low-level code 
and high-level data types helps to some extent in, e.g., [13] and COGENT [3]. Re- 
cent work on “push-button” verification includes a verified TLS library [12], Aws 
C Common library [11], file system [50], a hyperkernel [42], network functions [56], 
where the high degree of proof automation is in part achieved by statically 
bounding the state space of the systems. The latter work [56] specifically notes 
how non-experts can formulate high-level correctness requirements (their speci- 
fications are written in Python), as evidence that refinement-based approaches 
may ultimately overcome the “specification bottleneck” [6,43]. 


8 Conclusion and Outlook 


We have demonstrated an approach that can fully automatically synthesize and 
prove relational invariants over recursive data types and arrays. The approach 
is based on introducing quantifiers and recursion into the definition of such 
relations in a systematic way, and by instantiating this schema with constraints 
from joint transitions of the two systems. A somewhat surprising insight was 
that it is useful to view such transitions both forward and in reverse, leading to 
the classification into producers and consumers as a guideline for the search. 

We have presented a general synthesis algorithm and two concrete instan- 
tiations for different data structures of different sorts. The approach is fully 
automatic in guessing a relation and proving it correct. It relies on the recently 
developed CHC solver called ADTCHC which in turn is based on an SMT-based 
theorem prover ADTIND featuring a support for arrays, quantifiers and structural 
induction. The approach is modular and can be extended by further synthesis 
strategies in the future. In particular, since based on CHC techniques, it can be 
integrated with other existing CHC solvers tailored to non-ADT reasoning, and 
can be used in large-scale verification frameworks such as [24] that reduce the 
safety verification to CHC tasks. 

Many more interesting benchmarks lend themselves for further investigation: 
positional insertion and removal of lists, amortized data structures, benchmarks 
based on trees or nested arrays, and ultimately some real-world software systems. 
With a growing search space, it becomes more important to quickly recognize in- 
correct simulation relations, e.g., by evaluation-based counter-examples (cf. [31]), 
to prevent costly proof attempts. Similarly, incorporating external tools for in- 
variant generation is another topic for future work. 
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Abstract. Tools that automatically prove the absence or detect the 
presence of large floating-point roundoff errors or the special values NaN 
and Infinity greatly help developers to reason about the unintuitive nature 
of floating-point arithmetic. We show that state-of-the-art tools, however, 
support or provide non-trivial results only for relatively short programs. 
We propose a framework for combining different static and dynamic 
analyses that allows to increase their reach beyond what they can do 
individually. Furthermore, we show how adaptations of existing dynamic 
and static techniques effectively trade some soundness guarantees for 
increased scalability, providing conditional verification of floating-point 
kernels in realistic programs. 


1 Introduction 


Floating-point arithmetic is widely used across many domains, including machine 
learning, scientific computing, embedded systems, and the Internet of Things. 
Floating-point computations resemble real-valued arithmetic, but provide only 
finite precision, which commits roundoff errors at potentially every operation. 
While these errors are individually small, they propagate through an application 
and can make its results meaningless [47]. In addition, floating-point arithmetic 
features special values such as not-a-number (NaN) and Infinity [48]. As a result, 
these computations are very challenging for developers to reason about and 
debug manually. There is, therefore, a clear need for automated verification and 
debugging techniques for such computations. 

Unfortunately, today’s techniques do not handle realistic floating-point pro- 
grams well. Consider for example a program that simulates the interaction of 
several bodies under gravity. We took a C implementation of this N-body problem 
from Rosetta Code [5], which takes as input the masses, positions and velocities 
of—in our case—three bodies, and shows their evolution over a number of time- 
steps. The entire program is moderately-sized with 108 lines of code. Suppose 
that we want to verify the absence or presence of special floating values and 
cancellation (i.e. large roundoff) errors in this program. None of the currently 
available floating-point analysis tools is able to do this. 
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1 | int main(int argc, char* argv[]) {... // Reads masses, positions and velocities 
2 for(int i=0; i<timeSteps; i++) { simulate(mass, pos, v); ...} 

3 |} 

4 | void simulate() { compute_accelerations(mass, pos); ...} 

5 | void compute_accelerations(double mass[], vector pos[]){ 

6 for(int i=0;i<bodies;i++){ ... 

7 for(int j=0;j<bodies;j++) {if(i!=j) { 

8 acc[i] = numerical_kernel(mass[j], pos[i], pos[j], acc[i]);}}}} 

9 | vector numerical_kernel(double mass, vector pos_i, vector pos_j, vector acc) { 
10 return addVectors(acc, scaleVector(g*xmass/pow(mod(subtractVectors(pos_i,pos_j)),3), 

subtractVectors(pos_j,pos_i))); // compute acceleration 
11 |} 


Listing 1.1. Snippet of Rosetta code N-body simulation 


State-of-the-art static roundoff-error analysis tools [33,31,30,60,65,72| are in 
principle capable of proving the absence of both special values and large roundoff 
errors by computing an abstraction of the possible behaviors. However, they work 
only on small programs, mostly consisting of a single function, and thus do not 
work for our N-body example. The static tools that do scale [11,63,43] suffer 
from large over-approximations due to abstractions and thus effectively cannot 
prove the absence of issues either. Bounded model checking [52] or SMT decision 
procedures [25] perform exact bit-precise reasoning, but do not scale enough due 
to the complexity of floating-point arithmetic. 


On the other hand, there exist dynamic analyses that search for concrete inputs 
proving the presence of Infinities [38], NaNs or cancellation errors [10,21,78]. We 
could not apply any of these tools on our example, to a large part because they, too, 
have been designed for relatively small programs. More guided techniques such as 
symbolic execution [57] rely on a back-end SMT solver, for which floating-point 
theories have very limited scalability. 


We evaluated representative available tools on a new collection of floating- 
point benchmarks and get similar results for most of them (Section 5). 


We observed that often only a relatively small part of a program performs 
complex numerical computations—we call these parts the numerical kernels. 
Existing state-of-the-art floating-point analyzers can be applied to these kernels, 
provided that one can supply a precondition that bounds the kernel’s input ranges 
(their minimum and maximum values). Obtaining such preconditions manually is 
challenging, since the kernels are usually nested in loops as functions. Listing 1.1 
shows a subset of the N-body example; the numerical kernel that we identified is 
on line 9, nested behind several for-loops and function calls. 


Based on this observation, we propose a two-phase analysis that combines 
different program analyses to conditionally verify the absence of special values 
and cancellation errors in numerical kernels ‘concealed’ in large programs. First, 
we employ a scalable program analysis to infer the ranges of a kernel’s inputs in 
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the context of the containing application. In the second phase a different program 
analysis assumes these ranges to verify the kernels. 

The main insight behind this combination is that the first scalable analysis 
does not need to perform sophisticated floating-point reasoning; the domain 
specifications required for the second numerical analysis need to only capture 
input ranges of variables. 

The main challenge in our two-phase analysis is the first phase where our 
objective is to infer the ranges of the kernel inputs automatically. We first 
attempt to verify the numerical kernels fully soundly. Hence, we utilize abstract 
interpretation to infer sound ranges of kernel inputs. In case it is unable to infer 
useful (finite) ranges for the kernels, we propose to adapt existing blackbox and 
greybox fuzzing techniques [12], and evaluate them in their ability to produce 
large kernel input ranges capturing as many feasible inputs as possible. 

After inferring the kernel ranges, the second phase utilizes a slightly adapted 
existing static and sound roundoff error analysis [30] to verify the kernels. In 
case this analysis produces warnings for special values, we additionally utilize 
SMT-based bounded model-checking [52] to check for spurious warnings. 

Although there is a large body of work on combining different program 
analyses, our goal of analyzing real-world applications to verify their numerical 
kernels is novel. Our combination is specifically tailored to this setting, by 
considering the intricacies of floating-point arithmetic and the limitations of 
today’s analysis techniques in reasoning about them. 

Using a dynamic analysis in the first phase means that we are only able to 
infer approximations of the kernel input ranges. Consequently, we can verify 
the kernels only conditionally, because the verification is performed under the 
assumption that the input-domain specifications precisely describe possible values 
of the kernel inputs. Thus, we take a practical standpoint and relax the soundness 
guarantees in favor of wider applicability of today’s static floating-point roundoff- 
error verification techniques. 

Our evaluation shows that for 16 out of 24 kernels, this approach is able to 
verify that no special floating-point values occur; for 3 of those kernels, verification 
is sound. For 14 kernels, we additionally show the absence of cancellation errors 
that are a main cause of large roundoff errors. 


Contributions To summarize, our paper makes the following contributions: 


a) a two-phase framework that combines dynamic and static analyses to condi- 
tionally verify the absence of floating-point special values and large roundoff 
errors in kernels, 

b) a novel guided blackbox fuzzing technique to infer kernel ranges, implemented 
in an open-source prototype tool called Blossom, and 

c) an evaluation on a new benchmark set of mid-size numerical programs. 


Our benchmarks, the tool Blossom as well as scripts of all of our experiments are 
available at https://github.com/dlohar/blossom. 
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Fig. 1. Overview of our approach 


2 A Two-Phase Approach 


Figure 1 shows an overview of our two-phase approach that strives to increase 
the reach of existing floating-point analyses of floating-point numerical kernels. 
Our key observation is that such kernels appear in real-world applications from a 
variety of domains, but they are often ‘hidden’ behind several function calls and 
other non-numerical code that the round-off analyzers cannot handle. The first 
phase infers bounds on the input variables of a set of numerical kernels K that 
have been identified by a user in a program P. In the second phase, we utilize 
these ranges to (conditionally) verify the kernels, i.e. to (conditionally) prove the 
absence of special values and large roundoff errors. 

An alternative strategy would be to identify the largest kernel input ranges 
for which correctness can be guaranteed. However, even if one could infer such 
preconditions (we are not aware of a tool that performs such a backward analysis), 
our techniques for the first phase would still be needed to determine whether the 
program can execute the kernels on inputs outside of the safe ranges. 


2.1 First Phase: Whole Program Analysis 


In the first phase we have a whole program analyzer that, starting from the 
program inputs constrained by Z, infers bounds R on the kernel inputs auto- 
matically. These bounds are crucial, as the presence of cancellations and special 
values directly depends on the ranges of possible values; an unbounded input 
range will, in general, also lead to unbounded roundoff errors and special values. 

To obtain the kernel ranges, we need to analyze the entire program. In 
general, it is infeasible to compute the exact ranges, so that we want to approxi- 
mate them. We propose to first use a sound static analysis, which computes an 
over-approximation of the true ranges. They thus cover all feasible inputs, but 
additionally also spurious ones, so we want these ranges to be as tight (small) as 
possible. If the abstractions necessarily performed by the static analyzer become 
prohibitively large, we propose to use dynamic analysis to compute an unsound 
approximation of the kernel ranges. These ranges should be as wide as possible 
to capture as many concrete executions as possible. 
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Sound Static Analysis We choose abstract interpretation [26] and specifically the 
industry-strength analyzer Astrée [63] to infer a sound over-approximation of the 
kernel ranges, as Astrée scales for large programs with complex code and data 
structures and comes with a variety of abstract domains. 

The choice of the abstract domain in Astrée is, in general, a trade-off between 
the amount of over-approximation and the analysis running time. The interval 
domain abstracts a set of concrete variable values by their lower and upper 
bounds: |z, T] := {a | £x < x < z}. While operations on interval arithmetic [64] 
are efficient, intervals cannot capture correlations between variables and therefore 
over-approximate the real behavior (e.g. x — x # 0 in interval arithmetic). 
Nonetheless, for our benchmarks we have not observed any noticeable difference 
in the results with more sophisticated domains (e.g. octagon). This is likely due 
to our benchmarks having many nonlinear operations. Hence, we choose the 
interval domain as the numerical abstract domain for our purpose. 


Dynamic Analysis Fuzzing finds inputs that demonstrate certain (unwanted) 
behavior. We propose to fuzz a program and at the same time monitor the kernel 
inputs to record the lower and upper bounds seen during concrete executions. 

We instrument each user-specified kernel in the program with a kernel monitor 
that keeps track of the smallest and largest value seen for each kernel input. 
We repeatedly execute the instrumented program and report the minimum and 
maximum values seen for each kernel input over all executions. This approach 
crucially depends on the choice of program inputs that are used for fuzzing. We 
propose and experimentally compare blackbox, guided blackbox, and directed 
greybox fuzzing [12] as methods for input selection in Section 6. 

Blackbox fuzzing is a naive but effective technique in many testing situations. 
In our setting, the blackbox fuzzer randomly draws inputs from the program 
ranges Z, i.e. without any reference to the internal structure of the program. 

We further propose guided blackbox fuzzing that is guided toward enlarging 
the kernel input ranges. For this, the program input generator records those 
inputs that have widened the kernel ranges, and randomly generates new inputs 
that are within a certain (small) distance from these, in the hope that the new 
inputs would enlarge the monitored ranges even further. 

While blackbox techniques are straightforward to implement, they do not take 
into account the program structure. We thus evaluate an adaptation of directed 
greybox fuzzing, implemented in the the state-of-the-art tool AFLGo [12] that can 
be directed toward specific program locations, while exploring as many different 
paths in the program as possible. We first fuzz the program to obtain an initial 
estimate for the kernel input ranges with AFLGo (targeting the kernel). Then, 
we employ AFLGo in a refinement loop that iteratively attempts to widen the 
currently seen kernel input ranges. We instrument the kernels with conditional 
statements that check whether a kernel input is outside of the current kernel 
range. We use this conditional statement as a target for AFLGo, effectively 
directing it to find kernel inputs that are outside of the current estimate. If 
AFLGo finds a program input that widens the current kernel input range, we 
update it accordingly and iterate the process until a user-defined timeout. 
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2.2 Second Phase: Numerical Kernel Analysis 


With the ranges (R) inferred in the first phase, we analyze the user-identified 
numerical kernels (X) in the second phase with a static analyzer. Our objective 
in the second phase is to either show the absence of special floating-point values 
and large roundoff errors in a kernel or to generate warnings for the potential 
presence of such values. 

We use the sound floating-point roundoff analysis tool Daisy [30], which 
automatically proves the absence of special values and computes an absolute 
error bound for each kernel output. When Daisy generates a warning that special 
values can potentially occur, we use a SAT/SMT-based model checker that 
performs exact floating-point reasoning and that can identify spurious warnings. 

By itself, the error bound on the kernel output is not particularly helpful, 
however, since we do not know how this error propagates to the end of the 
program (although there exist scalable analyses that potentially can compute 
this information, e.g. [61]). That said, for many numerical applications the exact 
error bound is not important, since the algorithm itself is already approximate. 
For these applications, it is thus sufficient if we can show that the roundoff 
errors are not too large. We thus modify Daisy to report a warning when it 
detects a possible cancellation, i.e. when an arithmetic operation increases the 
relative error significantly (e.g. when two values that are close in magnitude get 
subtracted [42]). Additionally, Daisy includes an optimization procedure that can 
improve the accuracy of the kernels by rewriting the arithmetic expressions to 
commit smaller roundoff errors. We provide more details in Section 4. 


2.3 Soundness Guarantees 


To summarize, using the extended Daisy analysis, we can conditionally verify 
that kernels do not result in any NaN or Infinity, and that they do not commit 
cancellation errors, i.e. lead to large roundoff errors. When the kernel input ranges 
are computed soundly using abstract interpretation (e.g. Astrée), our verification 
is conditional in that we only verify the absence of cancellations for the kernels, 
but not for the rest of the program. 

When the ranges are computed using dynamic analysis in the first phase, 
they include more concrete values than the fuzzer witnessed. Values between the 
lower and upper bound are not necessarily observed by the fuzzer, and are also 
not necessarily feasible. If one were to consider only values witnessed at runtime, 
then it would be possible to analyze kernels for individual traces, although this 
would be quite expensive [10]. However, if we can soundly show that no special 
values or large roundoff errors (cancellations) occur inside a kernel for a given 
input range, we have shown this for more executions than can be explored by 
dynamic testing in general (since there are usually too many floating-point values 
to explore exhaustively). Unlike for a NaN or Infinity that are obvious to detect, 
cancellation cannot, in general, be detected by inspecting the computed results 
and thus our combination is valuable. 
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3 First Phase: Whole Program Analysis 


Abstract Interpretation with Astrée We utilize Astrée as it scales for large C 
programs with complex code and data structures. We add wrapper functions 
to provide bounds for global variables, since Astrée does not assume ranges 
for global variables directly. We further annotate the kernels K with Astrée’s 
__ASTREE_log_vars() construct. This construct records the range information that 
Astrée logs about the kernel inputs at the entry of the kernels. 

Note that the analysis of Astrée can be extensively parameterized with the 
knowledge of the program under analysis. Although this makes the analysis even 
more precise, it requires vast manual effort and knowledge of the intricacies of 
the program. To avoid this, we parameterize Astrée as generically as possible. 
We only use semantic loop unrolling until a defined loop bound to reduce the 
over-approximation in the analysis for all benchmarks. 


Blackbox Fuzzing with Blossom We implement our novel blackbox fuzzing for ker- 
nel range computation in a tool we call Blossom. Blossom works by instrumenting 
the program to be analyzed. Blossom is implemented as an LLVM pass and works 
on C, C++, and Rust input programs with complex programming constructs 
and data types (and would work for any programming language that compiles to 
LLVM). Blossom takes as input the program P, a configuration file that specifies 
the ranges of program inputs, the fuzzing technique that we want to execute 
(standard or guided blackbox), and a timeout. The LLVM pass automatically 
instruments P by inserting code that performs the indicated fuzzing process until 
the specified timeout, and records the ranges of kernel inputs. 

In order to perform vanilla blackbox fuzzing, the code is instrumented with an 
input generator that utilizes the srand() function with distinctive seeds to randomly 
generate values of program inputs from the set of input bounds Z. This process 
is continued until the specified timeout. 


Guided Blackbox Fuzzing with Blossom Algorithm 1 shows our guided blackbox 
fuzzing algorithm for generating program inputs to maximize kernel ranges. The 
algorithm is also implemented via LLVM-pass instrumentation in Blossom. 
The inputs to Algorithm 1 are the program P with an identified set of kernels 
K, a set of n program input ranges (Z), and a timeout (T). The algorithm is also 
parameterized by the number of mutations m and a constant c that determines 
the neighborhood radii for all program inputs from which mutants (new program 
inputs) are drawn. The algorithm returns a set of kernel ranges [{Rio}, {Rni}] 
(line 16). The goal is to compute the interval {R10}, {Rani }] as wide as possible. 
The algorithm keeps an input queue Q, which stores program inputs on which 
the program is to be executed. If Q is empty, m new random inputs taken from 
the program input ranges Z are added to it (line 6-7). If Q is not empty, the 
algorithm first dequeues one valuation of all the program inputs {v1,--- , Un} 
from Q (line 9), and executes the program P on these program inputs. During 
the execution of the program, the kernel monitor checks the kernel inputs and 
updates the kernel ranges as it is done in vanilla blackbox fuzzing (line 10). If the 
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Algorithm 1 Guided Blackbox Fuzzing 


1: procedure GUIDED-BLACKBOX(P, Z, K, T, m, c) 

2 Q + 4, {Rio} ={DBL-MAX}, {Rhi} ={DBL-MIN} 

3 {r1, +> , Tn} < computeRadii(Z, c) > generates mutation radii 
4: while T 4 0 do 

5: if Q == ġ then 

6: for i from 1 to m do 

7 Q < enqueue(generateRandomInput (Z)) > generates random inputs 
8: else 

9: {v1,+++ , Un} < dequeue(Q) 
10: {Rio}, {Rni}] < executeAndmonitorkernels(K’) 
11: if (kernelRangeUpdated([{Rio}, {Rni}])) then 
12: for i from 1 to m — 1 do 
13: {di,--+ dn} < mutate(v1 F r1, ,Un F'n) 
14: Q + enqueue({di,--- ,dn}) 
15: Q < enqueue(generateRandomInput(Z)) > avoids local max/min 
16: return [{Rio}, {Rai}] > returns kernel input ranges 


kernel ranges were updated, i.e. we found an input that led to the kernel input 
being outside of the currently known range, we generate m — 1 mutants from a 
program input {v1,--- , Un} by randomly drawing inputs from its neighborhood 
U1 F r1,’ ,Un F Tn and add them to the queue (line 12-14). (We draw mutants 
randomly from the neighborhood to reduce the possibility of duplicate program 
inputs.) The neighborhood, i.e. maximal distance of a mutant to the original 
program input, is defined by the neighborhood radii {r1,--- , rn} (computed 
once on line 3) that depend on the width of each input range. Effectively, if an 
input range is large, then we will draw mutants from a larger neighborhood as 
well. This step enables to search in the neighborhood of the inputs that enlarged 
the ranges of the kernels recently. Then, we generate one random input for all 
variables in the whole input range (line 15). This step ensures that we do not 
get stuck in a local maximum or minimum. The whole process is repeated until 
timeout T. 


4 Second Phase: Static Analysis with Daisy and CBMC 


Next, we use the computed kernel ranges R as kernel input specifications (pre- 
conditions) and adapt the state-of-the-art roundoff-error analyzer Daisy [30] to 
verify the absence of cancellation errors and special float values. The translation 
of kernels and the precondition annotation to Daisy’s input language in Scala is 
currently done manually, but could be automated in the future. 

Daisy’s core roundoff-error analysis performs a forward dataflow analysis. It 
computes ranges and worst-case absolute error bounds for each intermediate arith- 
metic (abstract syntax tree) expression using the interval and affine arithmetic 
abstract domains. As part of this analysis, it checks for overflows and invalid 
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expressions that could lead to NaN values, as their absence is a prerequisite for a 
meaningful roundoff-error computation. 

We extend Daisy to check at every intermediate expression for a possible 
cancellation, using the ranges and absolute error bounds that Daisy computes 
by default. At each binary arithmetic operation, we compare the relative errors 
of the operands with the relative error of the binary operation result. If the 
relative error increases more than a given factor, we report an error. We compute 
the relative error for an intermediate expression x as the ratio of its worst-case 
absolute error bound divided by the smallest value that the range of x contains. 
When the range of x ([2]) contains zero, we divide instead by some small constant 
©; ae mae. to make relative errors always well-defined. While this does not 
compute a sound bound on the relative error, this is not needed for our purpose, 
since we are only interested in a relative comparison. 

With this extension, we can prove for each kernel and the specified kernel 
input ranges, that cancellation and special values do not occur (but we cannot 
prove their presence). When Daisy cannot show this, it issues a warning with 
the possibly problematic intermediate expression. Spurious warnings for special 
values can be checked with a tool that performs exact reasoning, e.g. CBMC [52], 
and which reports a counterexample trace to the user who can use this trace to 
confirm whether the warning is genuine and if so, for debugging. 


Optimizing the Kernels Daisy furthermore provides a rewriting optimization that 
finds an ordering of an arithmetic expression for which it can show a smaller 
(absolute) roundoff error [32]. The rewriting relies on the fact that floating-point 
arithmetic is not associative and distributive and hence different evaluation 
orders commit errors of different magnitudes. Daisy’s algorithm uses real-valued 
identities such as associativity and distributivity to rewrite the expression. Using 
this optimization, we can thus locally improve the accuracy of the numerical 
kernels. 


5 State ofthe Art on Real-World Programs 


We collected a new set of real-world numerical programs from different application 
domains, as existing floating-point benchmark sets [29] cover kernels only. We 
first report on our experiments using existing representative state-of-the-art tools 
on these benchmarks, before evaluating our approach in Section 6. 


Benchmarks All our benchmark programs are existing programs collected online 
from a variety of domains such as scientific computing simulations (nbody, pendulum, 
lulesh, reactor, molecular), physics algorithms (fbench, arclength), numerical methods 
(linpack) and machine learning (tinearsvc). Table 1 provides an overview of the 
size and complexity of our benchmarks, as well as the number and arithmetic 
complexity of the kernels that we chose for verification. We also count the number 
of trigonometric operations (implemented in library functions) in the kernels, 
and the ‘depth’ column shows the number of function calls needed to reach the 
kernels from program entry. 
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. kernels 
benchmark |lang. LOC #in. #func. #loops J Denon. #iriesop. Aap 
arclength [68] C 31 1 1 2 1 20 5 1 
linearsvc [8] | C 32 4 1 3 1 7 - 1 
raycasting [6]| C 94 3 1 4 - 4 
nbody [5] C 108 21 10 9 2 9, 22 -,- 2,2 
pendulum [2] CARA A: iil 8 2 24, 42 Di 4,2 
fbenchv2 [1] C 215 8 2 5 2 6, 14 -,5 2,2 
molecular [4] | C++ 323 3 8 13 | 3 toy, ILA, ILL -,-,3 Iil 
fbenchv1 |1] C 380 8 10 8 4 19, 6, 14, 36 --,5,- 5,2, 2,3 
reactor [7| C++ 467 4 11 2 si AE GEIL ales} -,2,2 (Os Ah 
Linpack [3] C 544 5 12 31. fl 8 - 2 
lutesh [51] C++ 2187 5 43 74 |4 109, 77,14,41 -,-- 6, 7, 6,7 


Table 1. Benchmark statistics 


These benchmarks are single-threaded C or C++ floating-point programs 
with arrays, structures, branching, loops, and function calls (we translated the 
pendutum benchmark manually from Python to C). We modified the benchmarks by 
replacing dynamic memory allocation, pointer arithmetic, and I/O operations as 
appropriate, since these are challenging for most program analyses. We considered 
two versions of fbench: one with user-defined trigonometric functions (V1) and 
380 LOC, and another with their library versions (V2). We specified bounds on 
the program inputs manually and identified a set of numerical kernels containing 
a large number of arithmetic operations. 


State of the Art We first evaluate existing state-of-the-art tools on our benchmark 
set. For this, we choose CBMC, Astrée and AFLGo as representatives for model 
checking, abstract interpretation and directed greybox fuzzing, respectively. To 
the best of our knowledge, AFLGo was not used for floating-point debugging 
before. These tools check for assertion violations, so we have added assertions to 
our chosen kernels to check for absence of Infinity and NaN using the standard 
library functions isinf and isnan. 

We do not include a deductive verifier (e.g. [24]) in this comparison, because 
it requires detailed user annotations of every function. None of the state-of- 
the-art static roundoff-error analysis tools [43,33,31,30,60,65,72] work on the 
whole applications in our benchmark set. Available dynamic analyses for finding 
large roundoff errors [10,21,77,21,78,44] or special values [38,57,9] also work only 
on smaller programs (often restricted to kernels). Only the dynamic-analysis 
tool FPDebug [10] has been shown to scale beyond numerical kernels, but 
unfortunately the code has not been actively maintained over the years. 

All experiments are done for 64-bit precision and on a Debian server system 
with 2.67GHz and 50GB RAM. We have used CBMC version 5.12 with MiniSat 
2.2.0 (we have observed in our preliminary experiments that CBMC performs 
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better with MiniSat), Astrée’s Linux64_b5162300_retease and AFLGo downloaded on 
June 9, 2020. We have set a 1-hour time budget for all experiments and unrolled 
all loops for 50 iterations for both CBMC and Astrée. 

With CBMC and Astrée, we are able to prove the absence of special float 
values in linearSvc and rayCasting, two of the smallest benchmarks. Additionally, 
Astrée also proves the absence of special values in kernels 1 and 5 in fbenchv1. 
For all other C benchmarks (Astrée does not work on C++ programs), Astrée 
generates warnings for the potential existence of special values. With AFLGo, 
however, we do not find any special values within the time limit. 

For the nbody and pendulum benchmarks, we originally had larger program input 
ranges. For these, AFLGo was able to show the presence of special values in the 
kernels, suggesting that greybox fuzzing is effective for detecting special values. 
For the subsequent experiments, we have used tighter program input ranges to 
avoid special values. 


6 Evaluation of our Two-Phase Approach 


We next evaluate our two-phase approach. For a fair comparison with the state-of- 
the-art tools, we designate a 1-hour time limit for the entire analysis, allocating 
50 minutes for generating the kernel ranges and 10 minutes for the kernel analysis. 
We have empirically evaluated the effect of the time limit and observed that 
increasing the time does not affect the results of our benchmarks, but a smaller 
time limit led to worse results. 


Computing Kernel Ranges The main step is the computation of the kernel 
ranges. We compare the kernel ranges obtained with blackbox fuzzing (BB), 
guided blackbox fuzzing (GBB) (both implemented in Blossom), AFLGo with 
our iterative widening (AFLGo), and a combination of BB and AFLGo iterative 
widening (BB+AFLGo). We have empirically determined that with 5 mutants 
GBB performs the best for all our benchmarks. For AFLGo, we first fuzz the 
program for 5 minutes and then run our iterative widening that employs the 
fuzzer in a refinement loop to widen the so-obtained ranges (see Section 2.1) for 
the next 45 minutes. For BB+AFLGo, we use Blossom’s blackbox fuzzing for 25 
minutes to generate the initial ranges. On these ranges, we use our range-widening 
technique with AFLGo for the next 25 minutes. 

To compare the obtained kernel ranges, we first compute the width of each 
kernel range (%— x) and show in Table 2 the average width over all kernel inputs 
and over 5 runs with different random seeds. For our dynamic analyses, we want 
to maximize the kernel ranges to cover as many kernel inputs as possible. 

We also add the sound over-approximated ranges computed by Astrée, when- 
ever these are available. While Astrée produces a warning inside the arclength 
kernel, it still computes a finite range for the kernel input. 

In 5 out of the 7 kernels where Astrée finds non-trivial ranges, our fuzzing 
techniques also compute ranges that are close to Astrée’s. They are even equal 
in the case of rayCasting. In the other 2 cases, Astrée reports big ranges whereas 
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avg range width kernel 
benchmark k i 
oaeiai E erel avert) e BA. AFC BBLARIC GEE analysis 

arclength 1 1 6.16e+4 3.14 3.14 3.14 3.14 y 
linearSVC 1 4 3.73 3.73 3.71 3.72 3.73 (v) 
rayCasting 1 5 12.20 12.20 12.20 12.20 12.20 Jv 
iod 1 6 lee) 1.09e+5 6.67e+4 1.21e+5 1.02e+8 v 
ee 2 9 00 1.25e+4 8.45e+3 1.19e+4 8.91e+6 x 

1 4 lee) 14.80 12.86 14.82 14.56 Jv 
pendulum 

2 5 lee) 22.38 17.61 22.39 D216 Jv 

1 5 24.60 20.46 20.46 20.46 20.46 x 
fbenchv2 

2 5 lee) 21.36 21.36 21.36 21.36 x 

1 1 403.00 0.18 0.18 0.18 0.18 Jv 

2 5 20.50 20.46 20.46 20.46 20.46 X 
fbenchV1 

3 5 (ee) 21.36 21.36 24.76 21.36 x 

4 1 1.57 1.54 1.54 1.54 1.54 Jv 
Linpack 1 8 lee) 3.60e+6 4.44e+3 3.60e+6 2.11e+269 x 

1 4 x 9.04 9.04 9.04 9.04 x 
molecular 2 6 x 1.86 1.86 1.86 1.86 Jv 

3 y X 12.88 12.88 12.88 12.88 Jv 

1 1 x 1.00 1.00 1.00 1.00 A 
reactor 2 6 X 1.43e+2 9.35e+1 1.43e+2 1.46e+2 X 

3 1 X 2.50 2.50 2.50 2.50 v 

1 24 x 4.97 4.80 4.97 4.95 (v) 

2 18 X 6.09 5.51 5.50 5.89 Jv 
lulesh 

3 9 X 3.48 3.09 3.42 3.25 Jv 

4 12) x 5.95 5.49 5.93 S v 


Table 2. Comparison of kernel ranges generated by different techniques and settings 


all fuzzing techniques compute smaller ranges with the same width, suggesting a 
possible large over-approximation of Astrée’s ranges (or the inability of fuzzers 
to discover new kernel inputs within the time limit). 

In the other cases, when Astrée finds unbounded ranges or does not work, we 
observe that for all but 3 kernels, all four fuzzing techniques compute very similar 
range widths. For 3 kernels, however, GBB finds significantly larger ranges, thus 
discovering kernel inputs that the other methods are not able to find. We thus 
conclude that guided blackbox fuzzing appears to be most suitable for computing 
kernel ranges in our benchmarks, as it can discover apparent outliers. 

AFLGo often computes the smallest ranges. Our hypothesis is that because 
AFLGo aims to maximize the number of paths in the program to reach the target 
locations in the kernels, it focuses on generating values to find new paths rather 
than generating values exercising an already found path that may increase the 
width of the kernel ranges. 
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benchmark kernel #vars|BB =AFLGo BB+AFLGo GBB 
linearsvc 1 4 - 22 - - 
saa 1 6 121.05 312.93 144.86 181.26 

2 9 155.31 226.10 127.25 206.20 
ote 1 4 0.69 51.77 0.57 5.25 

2 5 0.69 44.37 0.54 4.48 
fbenchV2 1 5 7 7 ae ` 

2 5 - 0.04 Žž - - 

1 1 - 0.03 - - 
fbenchV1 2 5 - - 1799 - 

3 5 - 0.04 8.85 - 
linpack 1 8 0.01 100.15 - 114.58 
molecular 2 6 0.25 8.0 0.15 0.33 

1 1 - 0.01 - - 
reactor 2 6 2.51 11:32 2.91 2.80 

3 1 - 0.01 - - 

1 24 1.67 6.76 1.74 2.50 
MAN 2 18 4.28 19.73 15.59 6.96 

3 9 7.14 23.25 10.55 11.97 

4 12 3.91 16.13 3.49 5.88 

Table 3. Variation of computed kernel range widths (from the average width) for our 
three fuzzing techniques (in %), ‘-’ denotes no variation 


Effect of Randomness All fuzzing techniques (BB, GBB, AFLGo) rely on ran- 
domness. To evaluate how the computed kernel ranges are affected by it, we 
calculate the variation of the range widths compared to the average range width 
(per variable) over 5 runs. For 7 kernels, we do not detect any variation at all for 
any of the methods; Table 3 shows the variations for the remaining kernels. 

We observe that all methods have large variations for the benchmarks nbody 
and linpack, i.e. those for which GBB has found very large ranges. This suggests 
that there are a few corner-case inputs that lead to large kernel ranges (which 
only GBB was able to reliably find). Further, we see that AFLGo has a large 
range variation due to randomness for a few additional benchmarks, whereas BB 
and GBB have variations that are relatively small. 


Conditional Kernel Verification We were able to (conditionally) prove the absence 
of special floating-point values for 16 out of the 24 kernels, and (conditionally) 
prove the absence of cancellation errors for 14 of those kernels. We show these 
results in the last column of Table 2: ‘v?’ indicates that Daisy could prove both 
the absence of special values and cancellation in the kernel for the specified kernel 
ranges, ‘(v)’ indicates that only the absence of special values could be verified, 
and ‘X’ shows when Daisy reports a special-value warning. For the relatively small 
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benchmarks arclength, linearSVC and rayCasting, our verification of the kernels is 
sound, i.e. unconditional, as we used ranges computed by Astrée. 

When Daisy reports a warning, it is not guaranteed that a kernel can actually 
compute a special-value result, because of 1) Daisy’s over-approximation of the 
concrete program semantics, and because 2) the range we compute may contain 
values that are not feasible in the actual program execution. To help developers 
debug warnings reported by the static analyzer, we use CBMC on those kernels. 

CBMC reports counterexamples in all kernels for which Daisy reports warnings. 
Upon code inspection, however, we identified the counterexamples of nbody and 
fbench to be spurious for the particular program inputs we consider. In these 
cases, the true kernel input range was discontinuous, and the counterexamples 
were reported for the infeasible inputs. In particular, in kernel 2 of nbody, a NaN 
could be produced if the two bodies that are simulated collide, which would 
not happen for the initial conditions that we chose. Similarly, the kernels in the 
ray-tracing algorithm of fbench could produce Infinity, if the ray was chosen in a 
very particular way. With the program input ranges we have chosen, this was 
impossible. 

For linpack, the arithmetic overflow reported is indeed genuine, since a division 
by zero can occur before the kernel if the input matrix contains a zero on the 
diagonal, which leads to undefined behavior and the huge range of the kernel 
inputs. Similarly, for molecular and reactor, arithmetic overflow can occur for a 
specific position of molecules and a specific value of the angle between particle’s 
direction and the X-axis, respectively. 

We note that given the counterexamples produced by CBMC, we could 
straight-forwardly identify the warnings as spurious or genuine. In future work, 
one could consider refining the kernel monitoring, such that it would not only 
track a single range per kernel but could detect discontinuous ranges. 

Our extension of Daisy reports cancellation-error warnings for one kernel of 
linearsvc and one kernel of lulesh. We have used a threshold of 10° for reporting 
cancellation, i.e. if the relative errors of the operands and the result differ by more 
than three orders of magnitude, we report an error. We inspected the kernel code 
and confirmed that the cancellation warnings are genuine, i.e. there are indeed 
inputs that will result in a large roundoff error. The number of cancellations 
found may seem small. We suspect that this is the case, because our benchmarks 
were mostly written as reference or example programs (e.g. lulesh was developed 
to be a representative hydrodynamics simulation code), hence we expect them to 
be carefully developed and tested. 


Kernel Optimization We have additionally applied Daisy’s rewriting optimization 
on those kernels for which Daisy does not report possible special values. With 
this procedure, we could reduce the roundoff errors in 8 of the kernels out of 
which 6 cases are notable. We could reduce the error by 9.5% for linearsvc, 7.1% 
and 3.3% for two outputs of kernel 2 in pendulum, by 19.8%, 4.0%, 5.8%, and 5.8% 
for different kernel outputs of lulesh, and by 33.3% for one output of molecular. 
From these experimental results, we conclude that the ranges that we inferred in 
the first phase are actually useful for kernel analysis. 
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7 Related Work 


Abstract interpretation-based techniques are in principle uniquely suitable for 
verifying the absence of special values and safety in floating-point programs. We 
have chosen Astrée [63] in this work because it is an industrial-strength tool, 
and as such, supports a wide range of C programs and is designed for scalability. 
Apron [50] is a library of numerical abstract domains that are sound w.r.t. floating- 
point arithmetic, and includes, for instance, the domain of polyhedra [19], which 
is, however, significantly more expensive than the interval arithmetic domain 
that we use. ELINA [71] provides performance-optimized implementations of 
many numerical abstract domains, but its polyhedra domain does not support 
floating-point arithmetic. 

These domains only bound variable values; abstract domains [43,33,31,30] or 
optimization-based static analyses [60,65,72] for bounding roundoff errors provide 
nontrivial results only for relatively small kernels. For the second step in our 
framework, we could have in principle chosen any of these tools; we chose Daisy 
because we found it easy to modify for our needs, and because it already includes 
the rewriting optimization. 

In the space of deductive verification, besides Frama-C [24], the Boogie interme- 
diate verification language [53] also has support for floating-point arithmetic and 
discharges the verification conditions using the Z3 SMT solver. Similarly, bounded 
model checking [52] is limited by the performance of the underlying SAT/SMT 
solvers. While the floating-point support in today’s SMT solvers [17,16] has im- 
proved significantly in recent years, it is still limited to relatively few arithmetic 
expressions. 

Many interactive theorem provers have floating-point formalizations [49,15,37]. 
While these do allow to prove complex functional properties [13,14,46], the proofs 
are largely manual and require significant expertise. 

Blackbox testing has been explored to find large roundoff errors by executing a 
higher-precision version of the program side-by-side [10,21,77]. Recently, whitebox 
testing has been used for detecting overflows [38], by phrasing the search as a 
mathematical optimization problem, and large roundoff errors [21,78], by adapting 
the notion of condition numbers. KLEE-Float [57], FPGen [44] and Ariadne [9] 
use symbolic execution for finding bugs in floating-point code, including overflows 
and large precision loss and cancellation. While KLEE-Float relies on the floating- 
point SMT decision procedures, Ariadne approximates the path constraints and 
uses the real-valued theory. FPGen injects specialized inaccuracy checks to find 
cancellations. Only FPDebug [10] has been shown to scale beyond numerical 
kernels and, to the best of our knowledge, none of the dynamic techniques have 
been used to obtain range information. 

Once a large roundoff error has been identified, Herbgrind [69] can help to 
locate its root cause, which may be in a different instruction than where the error 
becomes significant. Herbgrind is thus complementary to our work and may be 
used to locate root causes of potential cancellation errors reported by Daisy. 

Rewriting floating-point expressions in order to optimize roundoff errors has 
been explored in the tool Herbie [67] and others [74,76]. These approaches attempt 
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to repair unstable code, checking accuracy using a dynamic analysis. They are 
alternatives to using Daisy for the second step in our framework. Alternative 
program optimizations that we have not explored in this work, but that also 
require range information, include mixed-precision tuning [32,20,68] and general 
non-semantics preserving approximation [70]. 

Apart from AFLGo [12], there is a wide range of targeted greybox fuzzers, such 
as those targeting specified program locations [18], rare branches [54], unexplored 
branches [55,73], or potential vulnerabilities [39,45,22,56]. In our setting, we 
require fuzzers like AFLGo to target the specific program locations of kernels. 

There is a significant body of work on guiding program analyzers. In particular, 
test case generation is typically guided by a static analysis toward specific parts 
of the code (e.g., [27,35,66,41,40,58,62,28,59,23,36,34,75,44]). Our approach is 
similar to these techniques as it infers input ranges to guide verifiers of numerical 
kernels toward those kernel executions that are relevant in the context of the 
containing application. 


8 Conclusion 


Even though floating-point programs have received a lot of attention recently, their 
focus has been largely on verifying or debugging arithmetic kernels. Our review 
of existing techniques and tools has shown that few approaches with specific 
floating-point support are applicable to whole programs without significant user 
expertise. We have found, however, that standard greybox fuzzing proved to be 
effective in detecting overflows and NaNs. Meanwhile, static-analysis techniques 
to show the absence of special values and cancellation errors remain limited to 
programs with few bounded loops and numerical kernels, respectively. 

Instead of trying to scale up existing roundoff-error analysis tools to whole 
programs, we combine them with more scalable analyses that compute the kernel 
preconditions needed for the roundoff analyses to work. We showed how relatively 
small adaptations to well-known techniques of directed blackbox and greybox 
fuzzing are enough to realize such a framework. Together with modifications to an 
existing roundoff-error analyzer, we are able to conditionally verify the absence 
of special values and cancellations in a number of numerical kernels in realistic 
floating-point programs that are out of reach for today’s analyses. At the same 
time, our analysis is precise enough to identify several cases of cancellations. While 
our approach is not suitable and not intended for certification of safety-critical 
systems, we believe that it nonetheless provides valuable debugging feedback for 
many real-world applications. 
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Abstract. Problems arising in many scientific disciplines are often mod- 
elled using edge-coloured directed graphs. These can be enormous in the 
number of both vertices and colours. Given such a graph, the original 
problem frequently translates to the detection of the graph’s strongly 
connected components, which is challenging at this scale. 

We propose a new, symbolic algorithm that computes all the monochro- 
matic strongly connected components of an edge-coloured graph. In the 
worst case, the algorithm performs O(p-n-logn) symbolic steps, where 
p is the number of colours and n the number of vertices. We evaluate the 
algorithm using an experimental implementation based on Binary Deci- 
sion Diagrams (BDDs) and large (up to 243) coloured graphs produced 
by models appearing in systems biology. 


Keywords: strongly connected components - symbolic algorithm - edge-coloured 
digraphs - systems biology 


1 Introduction 


Processing massive data sets poses a series of interesting computational challenges. 
A variety of these data sets can be modelled as very large multigraphs, augmented 
by a specific collection of application-dependent edge attributes. These attributes 
are often represented as colours and the resulting formalism is called an edge- 
coloured graph [4, 10]. Geographic information systems, telecommunications traffic, 
or internet data are prime examples of data that are best represented as such edge- 
coloured graphs. For instance, in social networking, it is typically used to identify 
groups of nodes related to each other by some specific criteria (Sports, Health, 
Technology, Religion, etc.) represented as colours. Our interest in processing huge 
edge-coloured graphs is primarily motivated by applications taken from systems 
biology [5,29] and genetics [25] where we have to deal not only with giant graphs 
as measured by the number of vertices and edges but also with large sets of 
colours. The colours in such graphs represent various parameters that influence 
the dynamics of a biological system [5, 9, 46]. 

Fundamental graph algorithms such as breadth-first search, spanning tree 
construction, shortest paths, decomposition into strongly connected components 
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(SCCs), etc., are building blocks of many practical applications. For the edge- 
coloured graphs, the primary research focus so far has been on some of the 
“classical” coloured graph problems, like the determination of the chromatic index, 
finding sub-graphs with a specified colour property (the coloured version of the 
k-linked problem), properly edge-coloured cycles and paths, alternating cycles, 
rainbow cliques, monochromatic cliques, monochromatic cycles, etc. [1-4, 55, 33]. 

To the best of our knowledge, we are not aware of any work on SCC decom- 
position for edge-coloured graphs, even though this problem has many important 
applications. For example, in biological systems, connected components represent 
the attractors of the system. These play an essential role in determining the 
system’s properties, since they may correspond, for example, to the specific phe- 
notypes of a cell [21]. The parameters (e.g. reaction rates) in such systems might 
be represented as edge colours in the state transition graph. The knowledge of 
attractors and how their structure depends on parameters is vital for understand- 
ing various biological phenomena [24, 38]. Other applications where investigation 
of attractors is crucial include predictions of the global climate change [52], or 
predictions of spreading of infectious diseases such as COVID-19 [39]. 

There is a serious computational problem related to the processing of massive 
edge-coloured graphs, even the non-coloured ones, that significantly affects the 
tractability of SCC decomposition. The graphs often cannot be handled with 
standard (explicit) representations since they are too large to be kept in the main 
memory. Various approaches have been considered to deal with such giant graphs: 
distributed-memory structures, structures for representing graphs symbolically, 
or storing the graphs in external memory. We review these approaches in more 
detail in the related work section. 

In [6,13] we have initially attacked the SCC decomposition problem for 
massive edge-coloured graphs by developing a parallel semi-symbolic algorithm 
for detecting terminal SCCs. The algorithm uses symbolic structures to represent 
sets of parameters, while the graph itself is represented explicitly. The results 
have shown that the parallel semi-symbolic algorithm is not sufficient for the 
practical needs to tackle large graphs representing real-world problems. Those 
findings have motivated us to propose an entirely symbolic approach. 

In this paper, we consider edge-coloured multi-digraphs, i.e., multi-digraphs 
such that each directed edge has a colour and no two parallel (i.e., joining the 
same pair of vertices) edges have the same colour. Here, we refer to such graphs 
simply as coloured graphs. For coloured graphs, we can define several notions 
of strongly connected components involving colours. We consider the simplest 
case, where the SCCs are monochromatic, that is all their edges have the same 
colour [35]. This choice is motivated by the application in systems biology, as 
mentioned above. 

We propose a novel fully symbolic algorithm for detecting all monochro- 
matic components in coloured graphs which is in practice significantly faster 
than is achievable with a naïve execution of an algorithm for symbolic SCC 
decomposition scanning all colours one-by-one, in particular on massive coloured 
graphs. This is because in many applications, the edges are largely shared among 
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individual colours [5] and our algorithm is capable of exploiting this fact. The 
algorithm conceptually follows the lock-step reachability approach by Bloem [14] 
for monochromatic digraphs. The key new ingredients behind our algorithm are 
a careful orchestration of the forward and backward reachability for different 
colours and a sophisticated selection of a set of pivots. 


1.1 Related Work 


The detection of SCCs in (monochromatic) digraphs is a well-known problem com- 
putable in linear time. Best serial (explicit) algorithms are Kosaraju-Sharir [50] 
and Tarjan [53], which are both inherently based on depth-first search. However, 
these algorithms do not scale for large graphs, e.g., those encountered in model- 
checking. Therefore, alternative approaches to SCC decomposition have been 
proposed (I/O efficient, parallel, symbolic algorithms). 

The algorithm of Jiang [32] gives an I/O-efficient alternative based on a com- 
bination of depth-first and breadth-first search. 

Efficient parallel distributed-memory algorithms avoid the inherently sequen- 
tial DFS step [45] in several different ways. The Forward-Backward algorithm [26] 
employs a divide-and-conquer approach relying on picking a pivot state and split- 
ting the graph in three independent (no crossing SCCs) parts. The approach of 
Orzan [44] uses a different distribution scheme called a colouring transformation 
employing a set of prioritised colours to split the graph into many parts at once. 
The recursive OWCTY-Backward-Forward (OBF) approach is proposed in [8]. 
It recursively splits the graph in a number of independent sub-graphs called 
OBF slices and applies to each slice the One-Way-Catch-Them-Young (OWCTY) 
technique. In [51] the authors utilise variants of Forward-Backward and Orzan’s 
algorithms for optimal execution on shared-memory multi-core platforms. Fi- 
nally, Bloemen et al. [15] utilise the important ability of Tarjan’s algorithm to 
return detected SCCs on-the-fly. In particular, they present an on-the-fly parallel 
algorithm showing promising speedups for large graphs containing large SCCs. 
On another end, GPU-accelerated approaches to computing SCCs have been 
addressed, e.g., in [7, 30, 37, 56]. 

Computing SCCs of (monochromatic) digraphs symbolically is another way 
to handle giant graphs and has been thoroughly explored in literature. As 
in the case of efficient parallelisation, depth-first search is not feasible in the 
symbolic framework [28]. In consequence, many DFS-based algorithms cannot be 
easily revised to work with symbolically represented graphs. An algorithm based 
on forward and backward reachability performing O(n?) symbolic steps was 
presented by Xie and Beerel in [57]. Bloem et al. present an improved O(n -log n) 
algorithm in [14]. Finally, an O(n) algorithm was presented by Gentilini et 
al. in [27, 28]. This bound has been proved to be tight in [20]. In [20], the authors 
argue that the algorithm from [27] is optimal even when considering more fine- 
grained complexity criteria, like the diameter of the graph and the diameter of the 
individual components. Ciardo et al. [59] use the idea of saturation [22] to speed 
up state exploration when computing each SCC in the Xie-Beerel algorithm, and 
compute the transitive closure of the transition relation using a novel algorithm 
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based on saturation. Besides these generic algorithms, there have been recently 
also proposed symbolic SCC decomposition methods to deal with specific large 
graphs, e.g., graphs generated by Boolean networks [42, 58]. 


2 Problem Definition 


As we have already stated in the introductory section, the SCC decomposition 
problem for edge-coloured graphs has remained mostly unexplored until now. We 
thus start this paper by introducing and formalising the notion of coloured SCC 
decomposition itself and state some of its basic properties. 

Before giving exact definitions, it might be instructive to discuss the substance 
of the coloured SCC decomposition intuitively. There are several ways of capturing 
the notion of a “coloured connected component”. For example, one of them is that 
of a colour-connectivity first introduced by Saad [47]. It is based on alternating 
paths in which successive edges differ in colour. However, there is no unique, 
universally acceptable notion of a coloured component. 

In the biological application we have in mind, we want to identify a coloured 
component as a coloured collection of SCCs—a collection where for every colour 
there is a set of all relevant monochromatic SCCs. Such setting leads us to 
represent SCCs in the form of a relation. To that end, we first introduce such a 
relation for monochromatic graphs (Section 2.1) and consequently extend it to 
edge-coloured graphs (Section 2.2). The relation-based approach gives us also 
the advantage of allowing a feasible symbolic encoding of the problem. 


2.1 Graphs and Strongly Connected Components 


Let us first recall the standard definitions of a directed graph and its strongly 
connected components: 


Definition 1. A directed graph is a tuple G = (V, E) where V is a set of graph 
vertices and E C V x V is a set of graph edges. 


We are going to use the word graph to mean directed graph in the following. 
We write u —> v when (u,v) € E and u >* v when (u,v) € E*, the reflexive 
and transitive closure of E. We say that v is reachable from u if u —* v. The 
reachability relation allows us to decompose a graph into strongly connected 
components, defined as follows: 


Definition 2. In a graph G = (V, E), a strongly connected component (SCC) 
is a maximal set W C V such that for all u,v E€ W, u-* v and v >* u. For a 
fixed v E€ V, we write SCC(G,v) to denote the SCC of G that contains v. 


If the graph G is clear from the context, we can simply write SCC(v). A 
set of vertices S C V is said to be SCC-closed if every SCC W is either fully 
contained inside S (W C S), or in its complement (W C V \ S). Notice that 
given a vertex v, the set of all vertices reachable from v, as well as the set of all 
vertices that can reach v, are both SCC-closed. 
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A pivotal problem in computer science is to find the SCC decomposition of G. 
As mentioned above, we represent the decomposition in the form of an equivalence 
relation Rsce such that the individual SCCs are exactly the equivalence classes 
of Rsce. The relation-based formulation of the SCC decomposition problem is 
the following: 


Problem 1 (SCC decomposition) Given a graph G = (V, E), find the SCC 
decomposition relation Rsc C V x V such that (u,v) E€ Rece if and only if 
SCC(u) = SCC(v). 


Note that SCC(u) is the section of the first attribute of Rscc, i.e. SCC (u) = 
{u | (u,v) E€ Rsec}. We denote such a section in the following way: SCC(u) = 
Rscc(u, -). Here, u is the specific value of an attribute at which the section is 
taken, and _ is used in place of the attributes that remain unchanged. Such 
notation naturally extends to relations of arbitrary arity. 


2.2 Coloured SCC Decomposition Problem 


We now lift the formal framework to the coloured setting. An edge-coloured 
graph can be seen as a succinct representation of several different graphs, all 
sharing the same set of vertices. Note that to emphasise the difference from the 
standard graphs as given in Definition 1, we sometimes call the standard graphs 
monochromatic. 


Definition 3. An edge-coloured directed multi-graph (coloured graph for short) 
is a tuple 6 = (V,C,E) where V is a set of vertices, C is a set of colours and 
ECV xC xV is a coloured edge relation. 


We also write u > v whenever (u,c,v) € E. By fixing a colour c € C and 
keeping only the c-coloured edges (with the colour attribute removed), we obtain 
a monochromatic graph 6(c) = (V,{(u,v) | (u,c,v) € E}). We call this graph 
the monochromatisation of 6 with respect to c. Intuitively, one can view the 
elements of C as a type of graph parametrisation where the edge structure of the 
graph changes based on the specific c € C. 

The SCC decomposition relation Rsee is extended to the coloured SCC 
decomposition relation Ñsce by relating every colour c € C with all SCCs of the 
monochromatisation 6(c). In consequence, the SCC decomposition problem is 
then lifted to the coloured SCC decomposition problem as follows: 


Problem 2 (Coloured SCC decomposition) Given a coloured graph 6 = 
(V,C, E), find the coloured SCC decomposition relation Rscee C V x C x V 
satisfying (u,c,v) E Rsce if and only if (u,v) € Rsce of G(c). 

From this definition, we can immediately observe the following properties 
about the relationship of %,,.- with the terms which we have defined before: 


— Reece Of a monochromatisation 6(c) is exactly the section Rsce(-, c, -); 
— SCC(G(c), v) is exactly the section Nsec(v, c, -). 


From this, it should be immediately clear that S%,-- contains all components of 
the underlying monochromatisations. 
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3 Algorithm 


Conceptually, our algorithm follows the lock-step reachability approach by 
Bloem [14] for monochromatic graphs. The lock-step algorithm itself is based on 
the basic forward-backward decomposition algorithm [57]. In this section, we first 
briefly introduce these two algorithms in order to explain better the key ideas 
behind our approach and, in particular, to explain what were the main difficulties 
encountered in employing the concepts of these algorithms to edge-coloured 
graphs. Although the algorithms were originally presented as producing a set of 
SCCs, we reformulate them slightly using the equivalent relation-based approach 
as explained in the previous section. After that, we present the coloured SCC 
decomposition algorithm. However, before we dive into the algorithmics, let us 
first briefly discuss the computation model we are using. 


3.1 Symbolic Computation Model 


As a complexity measure of our algorithm, we consider the number of symbolic 
steps, or more specifically, symbolic set and relation operations that the algorithm 
performs. As is customary, we assume that sets of vertices (V) and colours (C) 
can be represented symbolically (for example, using reduced ordered binary 
decision diagrams [17]) as well as any relations over these sets. In particular, we 
often talk about coloured vertex sets, by which we mean the subsets of V x C. 

Aside from normal set operations (union, intersection, difference, product and 
element selection), we also require some basic relational operations, all of which 
we outline in Fig. 1. These extra operations tend to appear in other applications 
as well (such as symbolic model checking [18]), and are thus typically already 
available in mature symbolic computation packages. 

Finally, there are several derived operators that are partially specific to our 
application to coloured graphs. However, these can be constructed using standard 
set and relation operations. The intuitive meaning of the derived operators is 
as follows: COLOURS returns all the colours that appear in the given coloured 
vertex set. PRE and POST compute the pre and post-image of a (monochromatic 
or coloured) set of vertices, i.e. the set of successors or predecessors of all the 
vertices in the given set, respectively. Finally, JOIN takes a coloured vertex set A 
and computes the set {(u,c,v) | (u,c) € A, (v, c) € A}. 


3.2 Forward-backward Algorithm 


To symbolically compute the SCCs of a graph G = (V, E), Xie and Beerel [57] 
observed that for any vertex v € V, the intersection W = F N B of the forward 
reachable vertices F = {v € V | v >* v'} and the backward reachable vertices 
B = {v' € V |v! -* v} is exactly the strongly connected component of G which 
contains v. 

The algorithm thus picks an arbitrary pivot v € V, and divides the vertices of 
the graph into four disjoint sets: W, F\W, B\W and V\ (FUB). This is illustrated 
graphically in Fig. 2 (left). The set W is then immediately reported as an SCC 
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Standard set operations 


pick element 
union 
intersection 
difference 
product 


PicK(A) 
AUB 
ANB 
A\B 
AxB 


arbitrary x € A 
{x«|cEeAVae B} 
{r|rEe AAxe B} 
{x«|cEAAx ¢ B} 

{(2,y) |r € AAy EB} 


Relation manipulation (R C S1 x... X Sn) 


i-th section at x 


existential 
quantification of 
the i-th element 


oi(x, R) 


Lu 


i(R) 


{(yi,. i 


(y1, see Yi-1, T, Yi+1,--- 


»Yi-1,Yit1,-- Un) | 


Uses, %(#; R) 


post-image 
coloured pre-image 
coloured post-image 
coloured join 


Post(G,A CV) 
PRE(G,A CV xC) 
Post(6,A C V x C) 


swap Swap(R C A x B) {(y,x)€ Bx A| (x,y) € R} 
Derived operations (G = (V, E), 6 = (V, C, E)) 
colours CoLours(A C V x C) Jı (A) 
pre-image PRE(G, AC V) 


Swar (3: ((A x V) 


DD 


Jon(AC V xC) (V x SwaP(A) N (A x V) 


Fig. 1. Summary of symbolic operations that appear in the presented algorithms. The 
derived operations can be implemented using the standard and relational operations. 
However, typically they also have a slightly more efficient direct implementations. 


of the graph, and added into the component relation: Rsce +} Rsce U (W x W). 
It is easy to see that every other SCC is fully contained within one of the three 
remaining sets (they are SCC-closed), and the algorithm thus recursively repeats 
this process independently in each set. 

The correctness of the algorithm follows from the initial observation and the 
fact that every vertex eventually appears in W (either as a pivot or as a result of 
FA B). In the worst case, the algorithm performs O(|V|?) symbolic steps, since 
every vertex is picked as a pivot at most once and the computation of F and B 
requires at most O(|V|) PRE/PosT operations. 


3.3 Lock-step Algorithm 


To improve the efficiency of the forward-backward algorithm, the lock-step 
approach [14] uses another important observation: To compute W, it is not 
necessary to fully compute both F and B; only the smaller (in terms of diameter) 
of the two sets needs to be entirely known. With this observation, the computation 
of F and B can be modified in the following way: Instead of computing F and 
B one after the other, the computation is interleaved in a step-by-step manner 
(dovetailing). When one of the sets is fully computed, the computation of the 
second set is stopped. Let us call the computed set converged and denote it by 
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V V 


Fig. 2. Illustration of the difference between the forward-backward algorithm (left) and 
the lock-step algorithm (right). On the left, we fully compute both backward (B) and 
forward (F) reachable sets from the pivot v, identifying W as F N B. On the right, 
without loss of generality, assume F is fully computed first. It thus becomes converged 
(Con) and the computation of B (Non) is stopped before it is fully explored. 


Con, and the unfinished set non-converged and denote it by Non. This situation 
is illustrated in Fig. 2 (right). 

However, even when Con is fully known, we still need to finish the computation 
of states in Non that are inside Con to discover the whole component W. This 
is necessary if there are vertices w in W whose forward distance from v (i.e. the 
length of the path v >* w) is short while their backward distance (the length of 
the path w —* v) is long, or vice versa. Such vertices are thus only discovered 
in one of the two reachability procedures and still need to be discovered by the 
other one to identify the whole component. However, an important observation 
is that only the vertices already inside Con need to be considered in this step. 

After this, the SCC can be identified and reported just as in the forward- 
backward algorithm. Finally, the recursion now continues in sets Con \ W and 
V \ Con. This is due to Non being not fully computed; we cannot guarantee that 
no SCC overlaps outside of Non (Non is not necessarily SCC-closed). 

The algorithm is still correct because every vertex is eventually either picked 
as a pivot or discovered in some W. Furthermore, due to the way Con and Non 
are computed guarantees that W is still a whole SCC. In terms of complexity, 
the algorithm performs O(|V|-log|V|) symbolic steps in the worst case. To see 
why this is true, we may observe that every vertex appears in W exactly once, 
and that the smaller of the two sets Con \ W and V \ Con, let us call it S, is 
always smaller than wh The authors then argue that the price of every iteration 
can be attributed (up to a multiplicative constant) to the vertices in S UW and 
that every vertex appears in S' at most O(log |V |)-times. 


3.4 Coloured Lock-step Algorithm 


When developing an algorithm for coloured graphs, we had to deal with multiple 
challenges which do not appear for monochromatic graphs and require careful 
consideration. In the following, we refer to the pseudocode in Algorithm 1. 

An important observation is that the structure of components in the graph can 
change arbitrarily with respect to the graph colours. In consequence, our algorithm 
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Algorithm 1: Symbolic Coloured SCC Decomposition 


1 Function COLOUREDSCC(6 = (V, C, E)) 
2 Reco C (V x C x V) e b; 
3 DECOMPOSITION(G, Rscc, V X C); 
4 return scc; 
5 Function DECOMPOSITION(G = (V, C, E), Rsce C (V x Cx V), V C (V x C)) 
6 if V = Í then return; 
7 | F,B, Ë, B C(VxC) + Pivors(v); 
8 F, B} C (V xC) +b: 
9 Fiock, Biock C C + 0; 
10 while Frock U Biock C COLOURS(V) do 
11 È cV xC e (Post(6, F) NV) \ F; 
12 BCVx Ce (PRE(6, B) nV) \ B; 
13 Frock < Flock U (COLOURS(V) \ CoLours( F )); 
14 Brock + Biock U (CoLouRs(YV) \ Corours(B) \ Frock); 
15 Fe FWA x Bea: 
16 B, + BU (Bn (V x Fiock)); 
17 FEF x Bid; 
18 Be B \ (V x Foer); 
19 FFU F 
20 BBU B: 
21 end 


22 Con CV xC + (FA(V x Fioc)) U (BA (V x Biock)); 
23 FAF Aion 
24 B & BLN Con: 


25 while F 40^ B #9 do 


26 È + (Post(6, F) N Con) \ F; 
27 Be (PRE(G, B) N Con) \ 5; 
28 FeFu F. 

29 BBU B: 

30 end 


31 WCVxXCeFB; 

32 Reco — Rscc U JOIN(W); 

33 DECOMPOSITION(6, Rscc, V \ Con); 
34 DECOMPOSITION(G, Rscc, Con \ W); 


35 Function Pivots(V) 
36 POCVxC)e+bVCVxO)ey; 
37 while V’ 49 do 


38 (v, c) + Pick(V’); 

39 P & PU ({v} x o1(v,V’)); 

40 V’ + V'\ (V x CoLours(P)); 
41 end 


42 return P; 
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cannot simply operate with sets of graph vertices as the normal algorithm would. 
To that end, we use the notion of coloured vertex sets as introduced in Section 3.1 
where the symbolic operations we perform on these sets have been described. 

Initially, the algorithm starts with all vertices and colours, i.e. the full set 
V x C. However, as the components are discovered, the intermediate results may 
contain different vertices appearing only for certain subsets of C. As a result, 
we often cannot pick a single pivot vertex that would be valid for all considered 
colours. Instead, we aim to pick a pivot set P C V x C such that for every colour 
that still appears in V, the set contains exactly one vertex. Alternatively, one can 
also view the pivot set as a (partial) function from C to V. This is done in the 
Pivots function. 

The lock-step reachability procedure also cannot operate as in a standard 
graph. First of all, there can be colours where the forward reachability converges 
first, as well as colours where this happens for backward reachability. The 
algorithm thus has to account for both options simultaneously. Second, for each 
colour, the reachability can converge in a different number of steps. To deal 
with this problem, we introduce the Flock and Block variables. These store the 
mutually disjoint sets of colours for which forward and backward reachability 
already converged. The lock-step procedure terminates when Flock and Bick 
contain all the colours that appear in V. 

Throughout the algorithm, we keep track of several coloured-set variables. 
The first two, F and B, represent the forward and backward reachable sets, 
respectively. We then have four variables È, T. B, B to represent the frontiers 
of these sets, i.e., the set of pairs (v, c) such that the vertex v has not yet been 
expanded in the corresponding reachability procedure for the colour c. The 
frontier of F is the set È U F. The sets F 

involves those colours for which the lock-step reachability procedure has not 
finished yet, while F represents the unfinished part of the frontier that shall be 
explored in the second while cycle; similarly for and B. 

In the first while cycle (lines 10-21), we compute the reachability sets in 
the lock-step manner. Once a reachability set is completed for some colours 
(i.e., there are no vertices to expand with those colours), we add the colours to 
the corresponding Flock or Block variable. Note that we ensure that Fj... and 
Biock remain disjoint even if the two reachability procedures converged at the 
same time for certain colours—see line 14. We use Flock and Block to split the 
newly computed frontier sets into the parts that are to be explored in the next 


and F,, contain disjoint colours — 


iteration (F, 6) and the parts that are currently left unfinished Gan By). 

After the first while cycle, we compute the set Con that is an analogue for the 
converged set of the original lock-step algorithm (line 22). As already suggested 
above and unlike the original algorithm, this set cannot be just F or B, but is 
instead a mixture of both, depending on the convergent colours. To compute this 
set, we use the Flock and Block variables. 

The second while cycle (lines 25-30) then completes the unfinished forward 
and backward reachability set, restricted to the inside of the converged set. The 
intersection of F and B then forms a coloured set W with the property that 
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for all c € CoLours(V), W(_,c) is a strongly connected component of 6(c). We 
create the corresponding relation using the JOIN operation, add this relation to 
the resulting Ascc, and recursively call the whole procedure with V \ Con and 
Con \ W as the base coloured sets of vertices. 

Let us note that there is possibly another approach. Instead of trying to work 
with all colours still appearing in the coloured vertex set at once, we cold fork 
a new recursive procedure whenever the colour set splits due to the differences in 
the graph structure. For example, instead of picking multiple coloured vertices 
as pivots, one could pick a single vertex with a valid subset of colours and then 
address the remaining colours in a separate recursive call. While such approach 
could be to some extent beneficial in a massively parallel environment where 
each recursive call can be executed independently on a new CPU, the amount 
of forking in large systems will soon become unreasonable. More importantly, 
it defeats the purpose of symbolic representation which aims to minimise the 
number of symbolic operations. 


3.5 Correctness and Complexity of the Coloured Lock-step 
Algorithm 


Theorem 1. Let 6 = (V,C,E) be a coloured graph. The coloured lock-step 
algorithm terminates and computes the coloured SCC decomposition relation Ksce. 


Proof. We first show that the set W computed on line 31 indeed contains one SCC 
for every colour c € COLOURS(Y) and that the recursive calls of DECOMPOSITION 
preserve the property that V is SCC-closed with respect to all colours. 

Let us assume that V is SCC-closed and let us take an arbitrary c € 
Cotours(V). The function Pivots chooses a set that contains exactly one 
pair whose colour is c, let us call this pair (v,c). Let us further assume that c is 
assigned into Flock first (the case with Brock is completely symmetric). 

Let us now choose an arbitrary vertex w such that v and w are in the same 
SCC of G(c), i.e. v >* w and w —* v. As the first while cycle finishes, F contains 
all the pairs of the form (u,c) € V where u is reachable from v in 6(c). Thus, it 
also contains (w,c) due to the fact that V is SCC-closed. Now, either (w,c) € B, 
or there exists a vertex x such that w >* x, x >* v in G(c) and x € B.. This 
means that (w,c) is added to B in the second while cycle. In both cases, both 
(v,c) and (w,c) are then added to W. As the vertex choices were arbitrary, this 
proves that the SCC of v in 6(c) is contained in W. Furthermore, if (y,c) € W 
for an arbitrary y, then v >* y and y >* v in 6(c), which means that y is in 
SCC(6(c),v). This proves that W contains exactly one SCC for every colour 
c E€ COLOURS(Y). 

We now argue that Con is SCC-closed with respect to all colours. This 
immediately implies that both V \ Con and Con \ W are SCC-closed. Let us 
assume that there is a colour c € COLOURS(V) and two vertices v, w in the 
same SCC of 6(c) such that (v, c) € Con, but (w,c) ¢ Con. Let us assume that 
c € Frock (as above, the case of Biock is completely symmetrical). Then (v, c) € F 


Symbolic Coloured SCC Decomposition 75 


after the first while cycle finishes. This also means that (w,c) € F as the forward 
reachability procedure is completed for c and thus (w,c) € Con, a contradiction. 

What remains is to show that the algorithm terminates and that every SCC 
is eventually found. Termination is trivially proved by the fact that size of the 
set V always decreases in recursive calls: both W and Con are nonempty, because 
they contain the initial pivot set as a subset. Clearly, a representant of every 
SCC of every monochromatisation 6(c) is eventually chosen as a pivot. Together 
with the above reasoning, this implies that the algorithm is correct. 


Theorem 2. Let |V| be the number of vertices in the coloured graph and let 
|C| be the number of colours. The coloured lock-step algorithm performs at most 
O(|C|-|V|-log |V|) symbolic steps. 


Proof. Let us first note that all the derived operations defined in Fig. 1 use 
only a constant number of the basic symbolic operations. As we are considering 
asymptotic complexity here, we can view all the operations in Fig. 1 as elementary 
symbolic steps. 

We first make the observation that each vertex may be chosen as a part of 
the pivot set at most |C| times. Clearly, once a vertex is included in the pivot 
set with a set of colours C”, then, {v} x C” C Con (due to the monotonicity of 
the construction of F and B) and the elements of {v} x C” do not appear in 
subsequent recursive calls. This means that the total complexity of the calls to 
PivoTs is bounded by O(|C]|- |V|) and we can exclude the calls from the rest of 
the complexity analysis. 

We now consider the complexity of a single call to DECOMPOSITION without 
the subsequent recursive calls. Let us now select one of the colours for which 
the lock-step reachability procedure (lines 10-21) finished last, i.e., one of the 
colours that have been added to Flock or Biocg in the final iteration of the cycle. 
Let us call this colour c. Recall that o2(c, Æ) is the set of vertices with colour c 
in a coloured set X. 

Let us denote by W := 02(c,W) and let S be the smaller of o2(c, V \ Con) 
and o9(c,Con \ W). Clearly S contains at most |V|/2 vertices. Let k = |S U W]. 
We now argue that the number of symbolic steps in a given call (without the 
recursive calls) is bounded by O(k). 

Assume w.l.o.g. that c € Flock (a completely symmetric argument solves the 
case c € Biock). Then o2(c, Con) = o2(c, F). If S is o2(c,Con \ W) then k is the 
size of o2(c, F). Each iteration of the first while cycle puts at least one vertex 
with colour c into F; otherwise c would not be one of the last colours to finish. 
This means that the cycle runs for at most k iterations. This also means that 
the size of o2(x, Æ) for all colours « and ¥ € {F, B} is also bound by k, which 
in turn means that the second while cycle cannot make more than O(k) steps. 

If S is a2(c, V \ Con) instead, let us define B := o2(c, B) right after the first 
while cycle has finished. We know that B C S U W: if a vertex v were in B\ S 
then (v,c) E€ Con = F and thus v € W. Again, each iteration of the first while 
cycle puts at least one vertex with colour c into B; otherwise c would have been 
in Biock before it appeared in Fock. Similarly to the previous case, this means 
that both while cycles run for at most O(k) steps. 
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The rest of the argument uses amortised reasoning, in a way similar to the 
proof in [14]. Note that each vertex is going to be an element of the set W as 
described above at most |C| times (once for each colour). Furthermore, each 
vertex is going to be an element of the set S as described above at most |C|-log |V| 
times: for each colour, the vertex can be an element of the smaller of the two 
sets at most log |V| times. As the cost of each single call can be charged to the 
vertices in S U W as explained above, it is sufficient to charge each vertex the 
total cost of |C| + |C|-log|V|. Together, this means that the total number of 
symbolic steps is bounded by O(|C|- |V|- log |V]). 


Note that the upper bound established by Theorem 2 is no better than the one 
we would get if we split the coloured graph into its monochromatic constituents 
and processed each monochromatic graph separately using the original lock-step 
algorithm [14]. We remark, however, that the coloured approach is a heuristic 
whose real complexity might be much smaller. Indeed, the complexity analysis 
in the previous proof focused on a single colour, omitting the fact than SCCs 
for many other colours are found at the same time. In case where the edges are 
largely shared among the colours, which is true in many applications, the heuristic 
has the potential to significantly outperform the parameter-scan approach. The 
situation is similar to that of the coloured model checking; see the observations 
made in [5]. 


4 Experimental Evaluation 


In this section, we examine the applicability of our algorithm in real-world sit- 
uations. First, we discuss how we implemented the algorithm and share some 
useful recommendations in this regard. We then look at how the implementa- 
tion performs on real-life coloured graphs which are derived from large models 
considered in computational biology. 


4.1 Implementation 


As our symbolic set representation, we consider standard reduced ordered binary 
decision diagrams (ROBDDs, or just BDDs for short) [17]. The source of our 
edge-coloured graphs are the transition systems of parametrised Boolean networks 
(PBN) as understood in [11, 60]. 

Boolean networks. Normal (non-parametrised) Boolean networks [34, 46, 
49,54] appear in computational systems biology as logical models of complex bio- 
chemical processes [16]. Here, we use the asynchronous variant of BNs introduced 
by Thomas [54]. A Boolean network consists of Boolean variables, each having a 
Boolean update function. Update functions are executed non-deterministically 
and change the state of the Boolean variables. The semantics of such a network 
is a directed graph where the vertices are the possible valuations of the Boolean 
variables and the edges are induced by the non-deterministic execution of the 
update functions. 
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This type of models is especially challenging for symbolic analysis. It is a 
well-known fact, that using symbolic structures, like BDDs, to represent very 
large state spaces gives good results for synchronous systems, but shows its limits 
when trying to tackle asynchronicity (see e.g. [23]). 

In the parametrised variant, the update functions can be partially unknown. 
This introduces a set of colours (parametrisations), each colour fully instantiating 
all update functions of the network. As a result, the semantics of such a model is 
an edge-coloured directed graph as we consider in this paper. For a full technical 
description of PBNs and their coloured graph semantics, please refer to [11]. 

Our implementation heavily relies on the existing internal libraries of our 
tool AEON [12], which at the moment partially supports symbolic analysis of 
PBNs. Specifically, AEON uses symbolic BDD-based representation of colour 
sets, but relies on explicit state space exploration. In this work, we extend these 
capabilities to fully symbolic analysis of the whole graph. 

Custom operations. Aside from implementing the POST and PRE opera- 
tions for a given PBN, we also choose to provide specialised implementations for 
the COLOURS and PIVOTS procedures. Especially for the PIVOTS procedure, this 
can greatly reduce the number of necessary symbolic steps, as we avoid picking 
pivots vertex-by-vertex. 

To implement these two operations as efficiently as possible, we always order 
the Boolean variables in our BDDs starting from the colour and ending with vertex 
variables. This ensures that both PrvoTs and COLOURS can be implemented by 
pruning the vertex variable nodes and minimising the BDD. 

Specifically, in this ordering, for COLOURS, all vertex nodes are effectively 
substituted with the true terminal node and the BDD is minimised. For PIVOTS, 
one (arbitrary) path of vertex variable nodes (corresponding to one pivot vertex) 
is preserved for every colour, and the rest of the vertex nodes are pruned. 

Trimming. Finally, most graphs typically contain a large number of trivial 
SCCs that introduce unnecessary overhead to the main algorithm. To avoid this 
overhead, we additionally perform a trimming step before each invocation of 
DECOMPOSITION. Trimming consists of repeatedly removing all vertices which 
have no outgoing or no incoming edges and is employed by most symbolic SCC 
algorithms on standard directed graphs as well. The coloured analogue of trimming 
is straightforward, as it can be achieved using PRE and POST operations just as in 
the non-coloured case. For a coloured set of vertices V, PoST(PRE(6,V)NV)NV 
returns only vertices which have at least one predecessor in V. The successor 
variant simply exchanges the POST and PRE operations. 


4.2 Experiments 


We evaluated our algorithm on 7 real-world networks based on the models from 
the Ginsim Boolean network database [19]. The experiments were performed 
on a 32-core AMD Ryzen workstation with 64GB of RAM memory. All tested 
models are available in our source code repository.? Note that the smaller models 


3 https: //github.com/sybila/biodivine-1lib-param-bn/tree/tacas 
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Table 1. Overview of the test models for the algorithm evaluation. The times 
(minutes: seconds) refer to the total runtime of the SCC decomposition procedure. The 
model variables and parameters give the number of Boolean variables necessary to 
represent the PBN symbolically. Finally, the graph size and colour set size specifies the 
magnitude of |V|- |C| and |C| for the coloured graph corresponding to the network. 


Model Name T Tee pey a ame 
aon al 48 ~ 24 | ~ 2! | 00:09.47 
ised 19 46 ~ 24 | m21 | 00:58.35 

a 3 54 ~ 2T | wo | 01:13.39 

re i) 18 44 ~ 28) | ~2'7 | 50:44.80 

ie aaa [41] 23 48 ~ 2% ~ Qi" 71:80.12 

gr 26 38 m28 | m22 | 78:38.34 

eoo [36] a 48 ~ 2T | ~ 27 | 118:34.88 


(< 230) should be easy to process even on a less powerful machine, however the 
larger models can require substantial amounts of RAM. 

The PBNs and their analysis runtime is summarised in Table 1. For each 
network, we specify the number of Boolean variables used by symbolic encoding, 
separated into model variables (vertices) and model parameters (colours), and 
the actual approximate size of the coloured graph. Note that not all combinations 
of parameters (possible graph colours) are usually biologically admissible, and 
these are filtered out before the coloured SCC decomposition. Hence the size of 
the graph is smaller than the space of all the considered BDD variables. 

From the presented results, we can draw the following observations: First, 
fully symbolic approach allows us to scale to much larger graphs than before, 
especially in terms of state space. Until now, AEON was typically limited (even 
for an easier problem of bottom SCC detection) to vertex counts of 215 — 220, 
exhausting memory even for much smaller state spaces when dealing with complex 
parameter space. Here, we can easily handle up to 23° vertices with non-trivial 
parameter space and we hope to push this number even higher with further 
optimisations to our experimental implementation. 

Second, the coloured heuristic is beneficial for symbolic computation. To 
support this claim, we considered a monochromatic variant of the decomposition 
problem for the WG Signaling Pathway and tested the basic lock-step algorithm 
on a collection of pseudo-random monochromatisations of this graph. Processing 
one such monochromatisation typically required 0.5 — 1 second. Considering the 
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graph in question has 2359296 colours, processing the colours one-by-one would, 
even in ideal conditions, take well above 300 hours (more than 12 days). 


5 Conclusions 


In this paper we have presented a fully symbolic algorithm for detecting all 
monochromatic strongly connected components in edge-coloured graphs. The 
work has been motivated by systems sciences, namely systems biology, where the 
need for efficient automated analysis of components in large graphs with large 
sets of coloured edges is emergent. The algorithm combines several ideas inspired 
by existing state-of-the-art algorithms for SCC decomposition in a non-trivial 
way. We believe this is the first fully symbolic algorithm aiming to solve the 
problem efficiently. 

The experimental evaluation has shown that in expected practical scenar- 
ios, the presented algorithm has a strong potential to be significantly faster 
than iterating a standard algorithm for SCC decomposition executed on all 
monochromatic sub-graphs one-by-one. 
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Abstract. NP-hard combinatorial optimization problems are pivotal in science 
and business. There exists a variety of approaches for solving such problems, but 
for problems with complex constraints and objective functions, local search algo- 
rithms scale the best. Such algorithms usually assume that finding a non-optimal 
solution with no other requirements is easy. However, what if it is NP-hard? In 
such case, a SAT solver can be used for finding the initial solution, but how can 
one continue solving the optimization problem? We offer a generic methodol- 
ogy, called Local Search with SAT Oracle (LSSO), to solve such problems. LSSO 
facilitates implementation of advanced local search methods, such as variable 
neighbourhood search, hill climbing and iterated local search, while using a SAT 
solver as an oracle. We have successfully applied our approach to solve a critical 
industrial problem of cell placement and productized our solution at Intel. 


1 Introduction 


Real-life combinatorial optimization problems are pivotal in science, operations re- 
search, engineering, economics, and business [11, 13, 20,21, 23]. 

Loosely speaking, an instance of a combinatorial optimization problem deals with 
the minimization of an objective function over a finite set, subject to feasibility con- 
straints (or, simply, constraints). The set of all elements satisfying the constraints is 
referred to as the set of feasible solutions (or, simply, solutions). In this paper, we focus 
on solving any problem, which can be expressed as a constraint optimization program 
(COP) [2]. Arguably, the vast majority of combinatorial problems, encountered in prac- 
tice, fall under this category. 

Many important combinatorial problems are NP-hard. For such problems, various 
algorithmic strategies have been devised, including complete methods, such as branch- 
and-bound and dynamic programming, and incomplete methods, such as greedy algo- 
rithms and local search. Each such method imposes requirements on the mathematical 
properties of the problem with a consequent limit on the scope of applicability. 

Local search algorithms stand out from the rest in that they impose relatively mild 
constraints on the type of the problem to be addressed, thus providing a wide scope of 
applicability. Furthermore, they seem to scale better with input size relative to complete 
algorithms [24]. This makes local search algorithms an attractive choice. However, lo- 
cal search algorithms may return a low-quality solution or no solution at all, given a 
problem for which the mere task of finding a feasible solution is NP-hard. Henceforth, 
we shall refer to such problems as NP-Hard-Feasible problems. 
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This paper introduces the Local Search with SAT Oracle (LSSO) methodology, that 
is, local search algorithms which use a SAT solver (or a SAT-based optimization algo- 
rithm; details appear later) as an oracle. A key advantage of our approach is that it can 
handle problems with complex constraints and objective functions. In particular, it can 
handle NP-Hard-Feasible problems. 

To see how SAT solvers might be useful, consider the basic version of a local search 
for an optimal solution. At the beginning, the local search generates an initial solution 
and sets it as the current solution. Then, it enters a loop. In each iteration, it looks 
for a solution with a lower value of the objective function within a neighbourhood of 
the current one. If such a solution is found, it is set to be the current solution, and the 
execution resumes. Otherwise, the algorithm terminates and returns the current solution. 

A key component of local search algorithms is the neighbourhood function, which 
assigns to each feasible solution a subset of feasible solutions, called its neighbour- 
hood. Ordinarily, a neighbourhood of the current feasible solution comprises a set of 
solutions which can be obtained from the current solution by applying a small collec- 
tion of feasibility-preserving perturbations to its combinatorial structure. A key con- 
cern is ensuring that neighbourhoods: (i) are polynomially searchable, and (ii) con- 
tain high-quality solutions. However, meeting both requirements might be challenging, 
since polynomial searchability implies that neighbourhoods should be small, and hence 
less likely to contain high-quality solutions. In addition, in the case of NP-Hard-Feasible 
problems, it is not clear how to achieve polynomial searchability, since a search should, 
in particular, be able to find a feasible solution, which is NP-hard. 

Our main idea is to let the SAT solver both find an initial solution and conduct the 
neighbourhood search. The designer can now define feasibility constraints and neigh- 
bourhoods declaratively, that is, by a set of SAT constraints. The designer has more 
freedom to choose neighbourhoods, which need neither be small, nor contain only so- 
lutions close to the current solution. This is because the search of the now complex and 
possibly large neighbourhoods is entrusted to SAT solvers, constructed precisely to ef- 
ficiently search large complex subspaces. Our approach lends itself to implementations 
of advanced local search variants, such as variable neighbourhood search, hill climbing 
and iterated local search [29]. 

An important feature of our algorithms is that they are anytime. Recall that an any- 
time algorithm is expected to return a valid solution even if interrupted. An anytime 
algorithm for an optimization problem is expected to find an improving set of solutions. 
The anytime property is essential for industrial application, since it allows the user to 
get an approximate solution even for very difficult instances [14, 15]. 

We demonstrate the usefulness of our approach by solving hard industrial instances 
of the NP-Hard-Feasible cell placement problem. Cell placement is one of the most 
important problems in VLSI automation [28]. Its most basic version concerns placing 
without overlap a set of rectangles on a grid, while minimizing the occupied area. In 
reality, the problem is more complex. Our approach has been successfully productized 
at Intel. 

The rest of this paper is organized as follows: Sect. 2 provides the necessary back- 
ground. Sect. 3 introduces our LSSO methodology. Sect. 4 shows how to solve place- 
ment with LSSO. Sect. 5 presents the experimental results. Sect. 6 concludes our paper. 
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2 Background 


This section provides some background. Sect. 2.1 is an overview of COP. Sect. 2.2 
describes the cell placement problem and shows how to reduce it to COP. Sect. 2.3 dis- 
cusses how one can solve a COP using a SAT-based bit-vector solver. Sect. 2.4 reviews 
local search. 


2.1 Constraint Optimization Program (COP) 


This work presents a new methodology for solving a wide class of combinatorial op- 
timization problems, which can be expressed as a Constraint Optimization Program, 
shown in Def. 1. 


Definition 1 (Constraint Optimization Program (COP) [2]). A constraint optimiza- 
tion program is a tuple (X,D,C,W) where: 
1. X = {x1 ... £n} is a finite set of variables often referred to as decision variables. 
2. D = {D,...D,} is a corresponding set of finite domains. Without loss of gener- 
ality, each D; is assumed to be a closed bounded interval of non-negative integers. 
3. C = {C,...Cm} is a finite set of constraints Cp : Dy, X +++ x Dyn > {0,1}. 
4. Y: Dı X- x Dn + Z is an objective function to be minimized. 


2.2 The Cell Placement Problem 


Cell Placement (Placement) is a major stage in the VLSI design cycle [8, 16]. The input 
of the cell placement problem comprises the following components: 


1. A rectangular grid region of M rows and N columns, on which the cells are to be 

placed. Row/column line numbering starts at 0 and ends at M/N, respectively. 

2. A finite set C of rectangular cells. The width and the height of each cell c € C 
are assumed to be positive integers, denoted by c¥*“” : 0 < cwitth < N and 
cheight ; Q < cheight < M, respectively. 

. A set R of forbidden rectangular regions. A forbidden region r € R is specified 
by 4 numbers r%est, psouth, peast and r”°rth (where, 0 < rvest, peast < N;0 < 
rsouth pnorth < JM; reast > pwest. pnorth ~ psouth), denoting the leftmost col- 
umn line, bottom row line, rightmost column line, and top row line, respectively. 

4. A finite set Z of nets, each consisting of a non-empty subset of cells. The nets may 

(and usually do) intersect. 


W 


We are interested in feasible placements, that is, placements in which no cell over- 
laps other cells or forbidden regions. Given a feasible placement, we define the size of 
a net n € T as the perimeter of the box bounding its placed cells. We define the size 
of the placement as the sum of the sizes of the nets. We are required to find a feasible 
placement of a minimal size. An example is shown in Fig. 1. 

In industrial practice, there may be additional industrial requirements, such as align- 
ing some of the cells, enforcing parity constraints (i.e., the user might require the y co- 
ordinates of some of the cells to be either even or odd) [19], ensuring a minimal distance 
between some of the cells and others. We omit further details due to IP considerations. 

Placement is NP-Hard-Feasible, since the NP-complete bin packing problem can be 
reduced to the decision version of the placement problem [10]. 
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2.2.1 Constraint Optimization Program for Cell Placement. We show how to con- 
struct a COP for the cell placement problem. For each cell c € C, let c*°st and 4% 
denote its leftmost and rightmost column respectively, and c°°“*" and ¢”°"*” denote its 
bottom and top row. Strictly speaking, it suffices to use c’”’°*’ and c*°““” as the COP’s 
independent variables, but it is convenient to use c°“** and c”°"*” as syntactic sugar for 
cwest 4 width and south 4 cheight respectively. The COP looks as follows: 


1. Variables: {c° 8°" | c € C} 
2. Domains: c’°*' € [0....N — 1] and 2” € [0...M — 1] 
3. Feasibility constraints: 

(a) Each cell cis placed wholly within the grid region: 


er > 0) A (er < N) A (oe > 0) A (cor < M) 
(b) For every pair of cells (c;,c;), such that i < j, there is no overlap: 
(cee > coast) V C= > gast) V (gous = oo Vv Ci > Cg) 
(c) For every pair (r, c) of a forbidden region r and a cell c, there is no overlap: 


(put > ct) Vv (eure > east) V (ee > cre) V (ones > grees) 


(d) Constraints representing any additional industrial requirements. 
4. Objective function W: for every net n € I, let ||n|| denote its size. We have: 


lnl] = (axte) = min(o*") + (azter) — min(c*”")) 


cen cen cen 
w=) Inl 
nel 
8 
Bı 
7 
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Fig. 1: Placement example [16]. A solution is shown for the problem of placing five cells ci, c2, 
c3, ca and Cs of sizes 4x 1, 4 x 3, 2 x 2,2 x 4 and 1 x 5 respectively, on a grid with M = N = 8. 
There are three nets: nı = {c1, C3, C5 }, n2 = {ce,c3} and na = {c2, c4 } (without any forbidden 
regions). The bounding boxes of the nets are B1, B2 and Bs, respectively. The sizes of the nets, 
comprising the perimeters of the bounding boxes, are 20, 18 and 20, respectively. The overall 
placement size is 20 + 18 + 20 = 58. The solution is an optimal one. 
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2.3 Solving COP with SAT 


A COP can be solved with various types of solvers [2]. In particular, it is possible 
to solve a COP by reduction to a series of SAT solver invocations through bit-vector 
reasoning as explained below. 


2.3.1 Bit-vector Solving and SAT. We start with reviewing the basic terminology, 
related to SAT solving. A literal | is a Boolean variable v or its negation sv. A clause 
is a disjunction of literals. A formula F is in Conjunctive Normal Form (CNF) if it is a 
conjunction (set) of clauses. 

A SAT solver [4] receives a CNF formula F' and returns a satisfying assignment 
(aka, model or solution), if one exists. In incremental SAT solving under assumptions [5, 
18], the user may invoke the SAT solver multiple times, each time with a different 
set of assumption literals (called, simply, the assumptions) and, possibly, additional 
clauses. The solver then checks the satisfiability of all the clauses provided so far, while 
enforcing the values of the current assumptions. 

A bit-vector variable (bit-vector) of width n = |B|, B = {vn,Un—1,.--, Ui}, isa 
sequence of n Boolean variables, called bits. Bit vı is the Least Significant Bit (LSB) 
and vp is the Most Significant Bit (MSB). A Boolean constant is either L (0) or T (1). 
A bit-vector constant is a bit-vector (BV), each one of whose bits is substituted by a 
Boolean constant. A bit-vector term is either a bit-vector, a BV constant, or a result 
of applying an operator which returns a bit-vector (for example, BV addition, if-then- 
else, concatenation) over other terms and atoms. An atom is either a Boolean variable, 
a Boolean constant or a result of applying an operator, which returns a Boolean (for 
example, = or unsigned-less-than), over BV terms and atoms. A bit-vector formula 
(also known as a bit-vector constraint) is recursively defined to be either an atom, a 
negation of a bit-vector formula, or the result of applying the Boolean operator ^ or 
the Boolean operator V over two or more bit-vector formulas. See [3, 12] for a rigorous 
description of the BV language. A BV solver decides the satisfiability of BV formulas. 

A BV formula F is satisfiable iff it has a model, that is, an assignment of BV and 
Boolean constants to their corresponding BV and Boolean variables, which satisfies F. 
In this paper, BV constants are interpreted as unsigned numbers, and BV comparison 
operators are interpreted as unsigned. For example, given a bit-vector B = {v3, v2, v1}, 
the formula F = B < 2 has two models pı : 41 (8B) = 0 and u2 : W2(B) = 1. 

All the algorithms presented in this work are assumed to use the so-called eager BV 
solver [6] which, following some preprocessing, translates the input BV formula to an 
equisatisfiable formula in CNF and solves it with a SAT solver. Thus, we will use the 
notions of BV solving and SAT solving interchangeably. We also assume the BV solver 
to have the same incremental API as a SAT solver. 

Since the variables in a COP have finite domains, both the variables and the con- 
straints of a COP can be easily expressed as BV variables and BV constraints. 

In particular, in the COP constructed for the cell placement problem in Sect. 2.2.1, 
the variables and the constraints can be expressed as BV variables and constraints as 
follows: For each cell c, we define four bit-vectors: c'’°** and c®**' of width [log N] 
as well as °°” and ce?" of width [log M]. All the constraints in our COP involve 
these bit-vectors and can be expressed in terms of operators and relations available in 
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the BV language [3]. Specifically, we implement min and max operators using a series 
of if-then-else operators. In addition, for every operator, we zero-extend the widths of 
the operands and the resulting bit-vector to prevent an overflow, whenever required. 

Reducing the constraints of a COP to a BV formula and invoking BV solver suffices 
to find one non-optimal solution. However, for solving the optimization problem by 
reduction to BV, one needs an extension of BV solving to optimization.! 


2.3.2 Extending Bit-vector Solving to Optimization. One can extend bit-vector 
solving to the so-called Bit-Vector Optimization (OBV) [19] as follows: 

A model yz of a BV formula F is T-minimal, for a given bit-vector T, iff u(T) < 
v(T) (where the comparison is unsigned) for every model v of F. Given a BV formula 
F anda term T = {ty,tn—1,..-,¢1} in F, where T is called the optimization target 
(or, simply, the target), Bit-Vector Optimization (OBV) is the problem of finding a T- 
minimal model of F. The bits of the target T are referred to as the target bits. 

Translating our placement COP to OBV is straightforward. We have already shown 
how to translate the constraints. The optimization target is constructed in the same way 
as the objective function W is constructed in the COP. 

How can one solve OBV in practice? First, one can use the following simple any- 
time Linear Search algorithm, implemented on top of an incremental BV solver [16,27]: 


1: solver. Assert(F); js := solver.Sat() > assert F and find the first solution 
2: while yy is a solution do > while there is still a solution 
3: solver.Assert(T < u(T)) > block all the solutions with cost > (T) 
4: u := solver. Sat() > can we improve our solution? 
5: return 4 > wis guaranteed to be T-minimal 


Another anytime algorithm to solve OBV is the following binary search-based al- 
gorithm, called OBV—BS [9, 19]: 


1: solver. Assert(F’); js := solver.Sat() > assert F and find the first solution 
2: i:=n > zis the current bit number, initialized to the MSB 
3: while i > 1 and u(t;) = L do > fix to L the MSBs, assigned to L in u 
4: solver. Assert(nt;) 
5: tiis] > after the loop, 2 will point to the first target bit, assigned T 
6: while: > 1do > Check one-by-one, if we can flip the remaining target bits to L 
7: u := solver. Sat({—t; }) > run the solver under the assumption ~t; 
8: if satisfiable then 
9: while (i > 1 and u(t;) = L) do solver.Assert(—t;); i := i — 1 endwhile 
10: else 
11: solver. Assert(ti);i := i — 1 > t; cannot be flipped to L, so we fix it to T 
12: return u 


We have successfully applied OBV-BS for solving the problem of fixing an existing 
placement [19], closely related to the generic placement problem, we are exploring 


' One cannot use MaxSAT [26]-the widely used extension of SAT to optimizing a linear Pseudo- 
Boolean (PB) function—-to solve COP in the generic case, since the objective function is not 
guaranteed to be linear PB. In particular, it is not linear PB for placement, if only because the 
variables are bit-vectors, rather than Booleans. 
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in this work. However, both Linear Search and OBV-BS failed to scale to industrial 
instances of our current problem of finding an optimal placement from scratch (with 
Linear Search scaling somewhat better than OBV-BS). 

Recently, we have introduced the so-called Polosat anytime algorithm [16], 
which can be used instead of the standard SAT solver inside Linear Search (and other 
SAT-based anytime optimization algorithms) to make it substantially more scalable. The 
idea behind Polosat, shown below, is to simulate local search using a SAT solver. We 
use the strictly-monotone version of Polosat [16], which assumes the availability of 
the so-called Boolean observable variables (observables) Obs, that is, a set of Boolean 
variables on which the objective function depends (for placement, the observables might 
comprise the bits of the bit-vectors, representing the sizes of the nets, for every net). 
Polosat is carried out by getting a model p and then trying to improve it by repeatedly 
flipping observables, which have not been assigned -L in previous models: 

1: function SOLVER.POLOSAT(assumptions) 
Require: Target bit-vector T is available; Observables Obs are available. 


2: u := solver. Sat(assumptions) > get the first model ju 
3 is_good_epoch := 1 > good epoch: an iteration, which improves ju 
4 while is_good_epoch do > one loop is an epoch 
5: B := {v : v € Obs, p(v) = T} > remove any observables, assigned _L 
6: is_good_epoch := 0 

T: while B is not empty do 

8 bi := B.front(); B.dequeue() 

9: o := solver.Sat(assumptions U {7;}) > trying to flip b; 
10: if satisfiable then 
11: if o(T) < p(T) then ps := o and is_good_epoch := 1 
12: B := {b:b E€ B,o(t)= 1} p> remove any observables, assigned L 
13: return Lu 


To combine Polosat into Linear Search, it is sufficient to replace solver.Sat invo- 
cations by solver.Polosat invocations in the code. ? We have shown in [16] that replacing 
plain SAT invocations by Polosat invocations in Linear Search makes our placement 
tool substantially more scalable. We reaffirm this result in Sect. 5. 

Yet, despite the significant progress we had witnessed when applying Polosat, 
we found that combining Polosat into Linear Search is still insufficient for solving 
a variety of complex real-world instances of our industrial placement problem. This 
empirical challenge lead us to develop our LSSO methodology, presented in this paper. 
As we shall see, combining LSSO and Polosat makes our tool considerably more 
scalable, while the methodology itself is generic and can be applied to solving a wide 
range of optimization problems. 


2.4 Local Search Algorithms 


Local search strategies [1] are a collection of algorithmic templates. An algorithmic 
template specifies the main flow of an algorithm, but leaves some details unimple- 


? Polosat also uses polarity fixing strategies, such as TORC [14,17], omitted here; please refer 
to [16] for details. Additional non-anytime OBV algorithms are introduced in [19,22]. 


94 <A. Cohen, A. Nadel and V. Ryvchin 


mented. By implementing these details for a specific problem, one obtains an algo- 
rithmic solution for that problem. 


2.4.1 Basic Local Search Strategy. The basic strategy generates an initial feasible 
solution and sets it as the current solution. Then, it enters a loop. In each iteration, it 
looks within a neighbourhood of the current solution for a feasible solution with a lower 
value of the objective function. If one is found, it is set to be the current solution. Other- 
wise, the algorithm is terminated returning the current solution. Note that this version is 
guaranteed to stop; it does so, when it reaches a local minimum of the objective function 
with respect to the neighbourhood used. 

To turn this algorithmic template into a complete algorithm, one has to implement 
the following problem-dependent items: (i) A procedure for generating an initial fea- 
sible element. (ii) A neighbourhood function assigning to each solution a subset of 
solutions. (iii) An algorithm for searching the neighbourhood for a better solution. 


2.4.2 Neighbourhood Functions. A key factor, which affects both the complexity 
of the search and the quality of the resulting solution, is the selection of a neighbour- 
hood function. In theory, the selection ought to depend on a mathematical analysis of 
the structure of the feasible set and the profile of the objective function. For complex 
problems, such an analysis is usually beyond reach. The classical approach to neigh- 
bourhood definition is based on the following problem-independent general principles: 


1. Drawing on an analogy to optimization algorithms in the continuous case (such as 
gradient descent or line search), a neighbourhood should be so defined as to make 
its elements “close” to the current solution. So, typically, the neighbourhood of a 
feasible solution is specified by a small class of feasibility-preserving modifica- 
tions/perturbations to its combinatorial structure. 

2. A neighbourhood should be so defined as to ensure that it is polynomially search- 
able. Hence, unless we have a sophisticated non-exhaustive neighbourhood search 
algorithm, neighbourhoods should be small. 


However, as we have argued in Sect. 1, this approach is not without issues. In par- 
ticular, feasibility-preserving perturbations may not be easy to find, especially for NP- 
Hard-Feasible problems, while having small neighbourhoods implies a low likelihood 
of high-quality solutions. 


2.4.3 Advanced Versions of Local Search. A disadvantage of the basic version of lo- 
cal search is that it may stop at a local minimum of a poor quality, if too small a region of 
the feasible space is explored. To circumvent this outcome, advanced variants enabling 
an exploration of larger portions of the feasible space have been devised [7,29]. Those 
described here provide some mechanism to escape from the local minimum to “nearby” 
solutions and resume the search from there. They have been designed to accommodate 
situations, where local minima are not distributed uniformly in the feasibility space, but 
are rather clustered in close proximity [25]. 

The variable neighbourhood search approach uses multiple neighbourhoods to es- 
cape from local minima. It relies on the fact that a local minimum with respect to one 
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neighbourhood need not be a local minimum with respect to another (if the latter is 
not contained in the former). The algorithm maintains a set of neighbourhood func- 
tions. Once a local minimum with respect to the current neighbourhood is reached, the 
neighbourhood is switched, and the search is resumed. 

The hill climbing method allows the selection of a non-improving solution, once 
a local minimum is reached. Since the objective function no longer monotonically de- 
creases, there is now a possibility of a cycle: a solution may be visited more than once 
forcing the search into an infinite loop. One can deal with this problem in various ways: 
ignore it and let the algorithm run until the timeout expires, use randomization, or in- 
troduce data structures that keep track of the search history and prohibit solutions that 
have already been encountered. The latter approach is referred to as tabu search. 

Another idea is to use large neighbourhoods. This approach increases the size of the 
explored region and the likelihood of better solutions. However, large neighbourhood 
search may become intractable. 

The iterated local search approach can be viewed as “a local search within a local 
search”. In each iteration of the search, it uses a subsidiary search algorithm to explore 
iteratively a feasible sub-space. Once a local minimum is returned, a new search is 
initiated in a region, whose elements are obtained by “perturbing” the recent solution. 

All the above approaches can be implemented within our LSSO framework. The 
key difference between LSSO and previous approaches is using SAT or Polosat as 
an oracle for both finding the initial solution and carrying out the neighbourhood search. 


3 Local Search with SAT Oracle (LSSO) 


This section introduces the main contribution of our paper. We propose using SAT as an 
oracle in local search algorithms to address the scalability and quality issues that arise 
in the classical local search algorithms, especially, given an NP-Hard-Feasible problem. 

Given a combinatorial optimization problem, the first stage in designing an LSSO 
solution is expressing the problem as a COP. 

In the second stage, the COP decision variables are translated to bit-vectors, and the 
feasibility constraints are translated to a BV formula (including any additional industrial 
requirements). One might experiment with several alternative formulations and select 
the one deemed best. 

The third step is defining the so-called neighbourhood generators. A neighbourhood 
generator N (u) accepts as an input a solution jz (that is, a model to the bit-vector 
formula, representing the COP), and generates neighbourhood constraints. The set of all 
the assignments which satisfy the feasibility and neighbourhood constraints constitutes 
the neighbourhood of the solution. Thus, finding such an assignment amounts to finding 
an element of the neighbourhood of p. 

A key ingredient of our methodology is the adoption of a neighbourhood concept, 
which differs significantly from the classical one, described in Sect. 2.4.2: 


1. The neighbourhood need not be small and need not contain (only) elements “close” 
to the current solution. 
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2. Normally, N (u) should generate constraints which ensure a cost lower than that of 
u. If such a formulation is possible, then an iteration of the local search algorithm 
merely needs to find a model to these constraints in order to progress. 

3. If the objective function is too complex to model in its entirety, a neighbourbood 
generator might attempt to ensure a better value for the objective function by im- 
posing constraints on the objective function’s sub-components. For example, when 
the objective function is a very large sum of bit-vector terms, one might impose 
constraints on the sum’s terms or small partial sums thereof. 

4. Notwithstanding the above, neighbourhood generators may support hill climbing, in 
which case, the constraints are so formulated as to admit non-improving solutions. 


Note that, in our approach, neighbourhoods direct the search to “higher-quality” 
regions with respect to the current solution, regardless of the algorithmic difficulties of 
searching such regions. This is another key aspect of our approach: we trust SAT solvers 
to search complex sub-spaces efficiently. 

Having discussed neighbourhoods, we are now ready to describe the simplest LSSO 
implementation: 


1. A BV solver instance is created and the COP is provided to the solver. Specifically, 
we represent the COP’s decision variables as bit-vectors, where the widths are cho- 
sen to accommodate the largest values. We provide the feasibility constraints to the 
solver as BV constraints. Then, we implement neighbourhood generators, which, 
given a feasible solution, return a set of BV constraints defining its neighbourhood. 

2. The local search is carried out as follows: 

(a) The algorithm obtains an initial solution by asserting the feasibility constraints 
and asking the solver for a model. This model is set as the current solution p. 

(b) The algorithm enters a loop, in which the solver operates in incremental mode. 
In each iteration, the algorithm calls the neighbourhood generator with the cur- 
rent solution as input, to generate a list of BV constraints. These are provided 
to the solver, which is asked for a model. If a model a is found, u is set to a. 
Otherwise, the algorithm terminates returning ju. 


The neighbourhood constraints can be given to the solver as either assumptions or 
assertions. This leads to two types of search, providing a tradeoff between execution 
time and quality: 


1. Non-speculative search: the neighbourhood constraints are passed to the solver as 
assertions. Once assertions are passed to the solver, they are enforced in all ensuing 
iterations. The search proceeds through a monotone sequence of decreasing neigh- 
bourhoods until a local minimum is reached. Thus, the search is localized and is 
relatively fast at the possible expense of quality. 

2. Speculative search: the neighbourhood constraints are passed to the solver as as- 
sumptions. The neighbourhood constraints are valid only for one iteration. Thus, 
the current neighbourhood is not intersected with previous neighbourhoods and a 
larger portion of the feasibility space will be explored. The search is expected to 
be slower, since the SAT solver handles assumptions less efficiently than asser- 
tions [18], but the quality of resulting solution is expected to be better, since the 
search can explore a greater part of the feasibility space, especially so by variable 
neighbourhood search and hill climbing. 
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Alg. 1 depicts our implementation of LSSO. The algorithm receives four inputs. The 
Boolean inputs VNS, HC, and SPEC specify whether variable neighbourhood search, 
hill climbing, and speculative search are to be used. All combinations are possible, ex- 
cept that hill climbing requires speculative search. The input Nmax applies to variable 
neighbourhood search. It specifies an upper bound on the number of consecutive neigh- 
bourhood switches without finding a solution. If that bound is exceeded, the algorithm 
terminates with the current solution. To effect variable neighbourhood search, the algo- 
rithm uses a predefined list of neighbourhood generators N = [No (u), Ni (u) ...]. The 
first generator Mo (u) is considered the default and is used most of the time. The others 
are used to escape local minima. 

Alg. | carries out iterated local search with Polosat as an oracle, where the ob- 
servables are recommended to be set to the bits of the inputs of the objective function. 
One can also replace the Polosat invocation by an ordinary SAT invocation. 


4 LSSO Algorithms for the Cell Placement Problem 


This section presents our LSSO-based placement algorithms. All the algorithms are 
instantiations of Alg. 1 with different sets of parameters. The BV constraints are gener- 
ated by translating the COP constraints, as discussed in Sect. 2.3. Each algorithm uses 
some of the neighbourhood generators defined in Sect. 4.1. 

The algorithms are presented in Sect. 4.2. None of the algorithms define the target 
bit-vector explicitly, since they rely on local search instead of OBV solving. By default, 
the algorithms use Polosat as the oracle, where the observables comprise all the bits 
of the bit-vectors, representing the sizes of the nets, where the size of net n is given by 
the following bit-vector term (for every intermediate term and the resulting term |||, 
its width is set to the minimal possible width which prevents an overflow, where the 
operators are zero-extended, whenever required): 


lnl] = (gaxte) = min(e") + (gate — min(c**")) 


cen cen cen 


4.1 Neighbourhood Generators 


4.1.1 Neighbourhood Generator Nj. Let u be a placement, that is, a model to the 
bit-vector formula representing the feasibility constraints. The neighbourhood Nj (u) 
is designed for a highly localized fast search at the possible expense of quality. To this 
end, the constraints corresponding to N; (u) force a decrease of the objective function 
in a very constrained manner, so as to help the solver to come back quickly. Ny (j1) 
consists of all of legal placements, for which all the nets are no bigger and at least one 
net is smaller than under jz, thus ensuring a lower cost. The constraints are: 


each net is no bigger at least one net is smaller 


—_— TT 
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nel nel 


98 A. Cohen, A. Nadel and V. Ryvchin 


Algorithm 1 Local Search with SAT Oracle (LSSO) 


1: procedure LOCALSEARCH(VVS = T,HC = T, SPEC =T, Niaz = 10) 


Require: £ > feasibility constraints 
Require: N := [No (u), M (nu). --] > neighbourhood constraints generators 
Require: 7 (x) > hill climbing constraints generator 
> From now on, confine the search to the feasible space 
2: solver. Assert(L) 
3: current + solver.Sat() > find the initial solution 
4: if scurrent then return None > the problem is unsatisfiable 
> Loop initialization 
J: best +— current 
6: stop + L > stopping condition 
1: jump <4 L > indicates whether hill climbing should be attempted 
8: i0 > current neighbourhood index 
9: while —stop do 
> Compute neighbourhood constraints 
10: if HC A jump then > hill climbing is required 
11: neighbourhood_constraints := J (current) 
12: else > hill climbing is not required 
13: neighbour hood_constraints := N [i] (current) 
> If the mode is speculative, constraints are assumptions; otherwise they are assertions 
14: if SPEC then 
15: assertions := |]; assumptions := neighbour hood_constraints 
16: else 
17: assertions := neighbourhood_constraints; assumptions := |] 
> Search for the next solution 
18: solver. Assert(assertions) 
19: next + solver.Polosat(assumptions) 
20: if next then > found a solution 
21: current + next;i + 0; jump + L 
22: if current.cost < best.cost then best < current 
23: continue 


31: 


> >œ © Solution not found 
> If we are in variable neighbourhood mode and the number of consecutive neighbour- 


hood switches without a model has not exceeded the bound, move to next neighbourhood 


if VNS A (i < (Nmax — 1)) then 
i} i+l 
continue 
> If we are in hill climbing mode, and have exhausted the bound on neighbourhood 


switches without getting a model, and hill climbing has not already been attempted in this 
iteration, attempt it in the next iteration 


if HC A —=jump then 
jump — T 
continue 
> If we got here, we are stuck and need to terminate 
stop + T 


return best 
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4.1.2 No: a Family of Neighbourhood Generators. The Nə family is designed for 
variable neighbourhood search. Each of its neighbourhoods strictly contains Ny and 
allows the objective function to decrease in more ways. This implies higher quality so- 
lutions at the expense of slower convergence. To define the Na family, let œ = ||I|| be 
the number of the nets and assume a > 3. For each permutation c of [1... a] and posi- 
tive number 2 < d < a we define a neighbourhood function N2[ø, d] (u) as follows: Let 
Ng(1)++++No(a) be the permuted sequence of the nets. Partition this sequence into [a/d] 
segments of size d (last segment could be shorter). The neighbourhood N2[ø, d](j) con- 
sists of all of legal placements, for which the sum of the net sizes of each segment is 
no bigger than under u, and the sum of at least one segment is smaller. Note that this 
ensures a cost lower than the placement under ju. By choosing different pairs (o, d), one 
may obtain different neighbourhoods. The constraints are: 


each sum is no bigger 


[a/d] / min(kd,a) min(kd,a) 
Al So eals 5 Hine) 


k=1 \i=(k—1)d+1 i=(k-1)d+1 
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at least one sum is smaller 


[a/d] / min(kd,q) min(kd,a) 
VÒ E mol < $ ale) 
k=1 \=(k—1)d+1 i=(k—1)d+1 


4.1.3 Hill-climbing Neighbourhood Generator N3. N3 is designed to implement 
hill climbing. We reason as follows: If the current placement is not a global minimum, 
there exists a placement with at least one smaller net. Hence, to tunnel away from the 
local minimum, we generate the following neighbourhood constraints: 


at least one net is smaller 


VV lll < allin) 
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4.2 LSSoO-based Algorithms for Placement 


All the algorithms below are instantiations of Alg. 1; they use lists of neighbourhood 
generators, composed of the ones defined in Sect. 4.1, where hill climbing is carried out 
by using the neighbourhood generator N3. Due to project deadline constraints, we did 
not explore other combinations. 


l. single_nbr_nonspec 
(a) parameters: VNS = L, HC = L, SPEC = L, Ninax = 1. 
(b) list of neighbourhood generators: [Ni] 
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2. many_nbr_nonspec 
(a) parameters: VNS = T, HC = L, SPEC = L, Naz = 10. 
(b) list of neighbourhood generators: N2[c, d] (u), enumerated by drawing o and d 
by a pseudo-random generator. 
3. many_env_spec 
(a) parameters: VNS = T, HC = L, SPEC = T, Nmax = 10. 
(b) list of neighbourhood generators: the first generator is N, and the rest are 
No2{0, d| (u), enumerated by drawing o and d by a pseudo-random generator. 
4. many_env_spec_hill_clmb 
(a) parameters: VNS = L, HC = T, SPEC = T, Nmaz = 1. 
(b) list of neighbourhood generators: [Ni] 
(c) neighbourhood generator Ns is used for hill climbing. 


5 Experimental Results 


We study the performance of the following algorithms within our placement tool: 


1. Algorithms which use Polosat as the satisfiability oracle: 
(a) 1s (Linear Search, described in Sect. 2.3.2, with Polosat as the oracle) 
(b) single_nbr_nonspec (see Sect. 4.2) 
(c) many_nbr_nonspec (see Sect. 4.2) 
(d) many_env-_spec (see Sect. 4.2) 
(e) many_env_spec_hill_clmb (see Sect. 4.2) 
2. Algorithms which use standard SAT solving as the satisfiability oracle: 
(a) bs_no_polosat [19]: OBV-BS (see Sect. 2.3.2). 
(b) 1s_no_polosat: Linear Search with SAT as the oracle 
(c) many_env_spec_hill_clmb_no_polosat: 
many_env_spec_hil1l_clmb with SAT instead of Polosat (to study the 
impact of disabling Polosat on LSSO, we chose 
many-env-_spec_hill_clmb, since, as we shall soon see, it outperforms 
the other LSSO algorithms in a pairwise comparison). 
3. virtual-—best: represents the best result of the above algorithms per timeout. 


We used an extensive set of 1200 proprietary industrial designs of various sizes and 
complexities. The sizes of the grids (where a grid size is the width N multiplied by 
the height M) can be characterized as follows: a) Minimum size = 70; b) Maximum = 
364000; c) Average ~ 4643; d) Standard deviation ~ 18829. We used machines with 
32Gb of memory running Intel®) Xeon®) processors with 3Ghz CPU frequency. 

We ran the algorithms for 600 seconds and measured the quality of the placement 
at different time intervals. Fig. 2 shows our main results. For each algorithm and time 
interval, Fig. 2 displays a score which represents the quality. The score is a real num- 
ber between 0 and 1 inclusive, where the closer the score is to | the better. For each 
algorithm and time interval, the score is computed as follows: we compute the average 
value of the following score-per-instance: (the result of virtual-best in 600 sec.) / (the 
result of the current algorithm within the current time interval). Our conclusions: 
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First, when using SAT as the oracle, Linear Search (1s_no_polosat) outperforms 
OBV-BS (bs_no_polosat), demonstrating that OBV—BS is not useful when the opti- 
mization target is a complex arithmetic expression (rather than a vector of lexicograph- 
ically ordered bits, where each bit is a result of a separate calculation as in [19]). Based 
on this result, we preferred Linear Search over OBV—BS as the baseline algorithm. 

Second, confirming the conclusion of [16], Polosat makes Linear Search sub- 
stantially more efficient (compare 1s to 1s_no_polosat). 

Third, and more importantly in the context of this work, our best novel LSSO algo- 
rithm even without Polosat (many_env_spec_hill_clmb_no_polosat) is al- 
most as efficient as Linear Search with Polosat (1s), the latter being the state-of- 
the-art in solving placement [16]. Moreover, the best Polosat-based LSSO algorithm 
(many_env_spec_hill_clmb) is significantly more efficient than both aforemen- 
tioned algorithms. This result justifies the usage of both major components of our so- 
lution: LSSO-the high-level local search on top a satisfiability oracle, presented in this 
paper, and Polosat [16]—the low-level local search simulation with SAT. 

Finally, the virtual best algorithm yields the absolutely best result, providing evi- 
dence that development of different LSSO algorithms pays off. 

Additionally, Table 1 shows a pairwise comparison between our four Polosat- 
based LSSO algorithms. many_env_spec_hill_clmb outperforms the others. 

Table 2 offers a fine-grained comparison between our best novel LSSO algorithm 
many_env_spec_hill_clmb and the Polosat-based Local Search 1s, the latter 
being the state-of-the-art in solving placement [16]. The comparison is provided per 
grid size category and for two different timeouts. LSSO improves the performance sig- 
nificantly for every input size category for both timeouts. Comparing the results for the 
two timeouts on the biggest instances shows that increasing the timeout makes the gap 
between LSSO and 1s more significant, given large grids. 

Finally, Table 3 shows the unique contribution of each algorithm to the virtual best 
in 600 sec. (we dismissed all the instances on which there was more than one best- 
performing solver). Notably, each of the LSSO algorithms is a contributor. Surpris- 
ingly, many_nbr_nonspec contributes more than many_env_spec_hill_clmb, 
despite the latter algorithm outperforming the former in a pairwise comparison. A 
possible explanation is that we ran many_nbr_nonspec with Polosat only, while 
many_env_spec_hill_clmb was run twice with Polosat and SAT. Another sur- 
prising result is the significant contribution of 
many_env_spec_hill_clmb_no_polosat, second only to many_nbr_nonspec, 
implying that a SAT-based LSSO algorithm should be part of any parallel portfolio. 


many-nbr_nonspec]}single_nbr_nonspec|many_env_spec 
many-env_spec-_hill_clmb (730 141 329) (813 253 134) (227 893 80) 
many-nbr_nonspec (815 147 238) (344 170 686) 
single_nbr_nonspec (130 280 790) 


Table 1: Pairwise comparison between LSSO algorithms for the timeout of 600 sec. Each non- 
empty cell (r,c) contains a comparison between Algorithm R in row r and Algorithm C' in 
column c. The value (w d 1) in each non-empty cell is interpreted as follows: R outscored C on 
w instances; there was a draw on d instances; C outscored R on l instances. 
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o 
‘ 
6 
fx) 
a 
bs_no_polosat 
ls 
1s_no_polosat 
nany_env_spec 
nany_env_spec_hill_clnb 
nany_env_spec_hill_clnb_no_polosat 
nany_nbr_nonspec 
single_nbr_nonspec 
virtual_best 
58 168 158 2686 258 388 358 488 450 588 688 
Tine 
Fig. 2: Comparing Algorithms Over Time 
Grid si Timeout of 600 seconds Timeout of 300 seconds 
ne See 1s is better] Draw | LSSO is better|| 1s is better] Draw|LSSO is better 
< 500 27| 62 337 21| 56 349 
> 500 & < 10000 57| 74 551 57| 91 534 
> 10000 17| 28 47 18| 40 34 
Table 2; Comparing the best Polosat-based LSSO algorithm 


(many-env-spec-hill-clmb) to the Polosat-based Linear Search (ls), the latter 
comprising the previous state-of-the-art. 


6 Conclusion 


We have presented a new methodology for solving NP-hard combinatorial optimization 
problems, called Local Search with SAT Oracle (LSSO). Our approach can handle prob- 
lems for which finding even one feasible solution is already NP-hard. LSSO applies lo- 
cal search which uses a SAT solver or the SAT-based optimization algorithm Polosat 
as an oracle. We have introduced a generic algorithm which integrates different local 
search schemes within the LSSO framework. Furthermore, we have implemented our 
approach in an industrial tool for solving the cell placement problem in VLSI and have 
shown that our new LSSO approach makes the tool substantially more efficient. Our 
tool has been successfully productized at Intel. 


Algorithm Contribution | | Algorithm Contribution 
many-nbr_nonspec 240||1s 33 
many-env-spec-hill-clmb-no-polosat 181||many-env-spec 21 
many-env_spec_hill_clmb 79||1s_no-polosat 12 
single_nbr_nonspec 54||bs_no_polosat 8 


Table 3: Unique contribution to the virtual best per algorithm (sorted by the contribution). 
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Abstract. Infrastructure as Code is a new approach to computing in- 
frastructure management that allows users to leverage tools such as ver- 
sion control, automatic deployments, and program analysis for infras- 
tructure configurations. This approach allows for faster and more ho- 
mogeneous configuration of a complete infrastructure. Infrastructure as 
Code languages, such as CloudFormation or TerraForm, use a declara- 
tive model so that users only need to describe the desired state of the 
infrastructure. However, in practice, these languages are not processed 
atomically. During an upgrade, the infrastructure goes through a series of 
intermediate states. We identify a security vulnerability that occurs dur- 
ing an upgrade even when the initial and final states of the infrastructure 
are secure, and we show that those vulnerability are possible in Ama- 
zon’s AWS and Google Cloud. We call such attacks intra-update sniping 
vulnerabilities. In order to mitigate this shortcoming, we present a tech- 
nique that detects such vulnerabilities and pinpoints the root causes of 
insecure deployment migrations. We implement this technique in a tool, 
Hayha, that uses dataflow graph analysis. We evaluate our tool on a set 
of open-source CloudFormation templates and find that it is scalable and 
could be used as part of a deployment workflow. 


1 Introduction 


Managing an infrastructure of thousands of hosts, with different software and 
servers is nearly impossible to do manually. A relatively new approach to in- 
frastructure management is called Infrastructure as Code (IaC). This has given 
rise to many different tools with a shared goal: helping system administrators 
manage their infrastructure in the same way as they manage code. Some tools, 
like Ansible [20], Puppet [23] or Chef [6] are Configuration Management tools: 
they allow the administrator to specify the entire configuration of one or more 
running machines and automatically deploy it by connecting to that machine 
and performing administrative tasks on behalf of the administrator. These tools 
automatically detect and apply the steps necessary to switch from the current 
state of a machine to the desired state, specified by the administrator. Similarly, 
tools like Amazon’s CloudFormation [3] or Hashicorp’s Terraform [11] read a 
description of the desired infrastructure and automatically take the necessary 
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(b) An insecure 
update order 


(a) The initial 
deployment 


(d) The target updated 
state 


(c) A secure update order 


Fig. 1: A deployment of a computation (the orange lambda), accessing a database 
(the blue disk stack), which is accessible to the outside world through an API 
(the purple gateway). The upgrade should change the computation to access 
more sensitive data (the lambda with the subscript 2), but be authenticated 
through a user check (the red identification checks). 


steps to deploy that infrastructure. In CloudFormation, an infrastructure con- 
figuration is declared as a set of resources. 

Benefits of IaC are well-known among practitioners: the entire infrastructure 
is described accurately by a configuration file, making it easy to debug or vi- 
sualize the infrastructure. This way the infrastructure can be version controlled 
and documented as any other programming language. The tools help guarantee 
identical configuration of hosts, making it an essential practice for security and 
maintainability. 

However, for all the benefits IaC brings, it also opens new security vulnera- 
bilities. We have identified a new class of vulnerability issues that appear while 
the tool is operating on the infrastructure. In order to decrease infrastructure 
upgrade times, deployment tools typically will run many operations in parallel. 
We argue that this parallelism, as well as the global naming used in these infras- 
tructures, can lead to discrepancies during the upgrade that lead to a violation 
of the intended security policy, even if the initial infrastructure and the target 
infrastructure are both perfectly secure. We empirically validate our claims by 
reenacting this vulnerability in both, Amazon’s AWS and in Google Cloud. 


1.1 Proof of Concept 


When upgrading the infrastructure, if operators do not provide enough depen- 
dencies, ie. they do not impose an ordering on upgrade operations, a security 
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policy and a protected service might be upgraded in an order that exposes pri- 
vate data. Consider an example given in Figure 1: an API service that replies 
to any request with some benign information, as depicted in Fig. la. The ser- 
vice is upgraded so that the API returns private information about users, and 
the security policy is modified to allow only authenticated users to access the 
service, as shown in Figure 1d. This architecture is a core architectural build- 
ing block for serverless computing. This same configuration is recommended in 
AWS’s “Well Architected” developer guideline series [1]. The upgrade code is 
functionally correct and implements the desired change, but the user did not 
specify ordering constraints. However, without such constraints, there are two 
possible upgrade plans. First, as shown in Figure 1b, the backend computation 
may be updated first. In this case, since the authentication has not yet been 
added to the API, there is a short period of time where private data is publicly 
accessible. The amount of time this information is exposed depends on the cloud 
service provider and the particulars of the infrastructure, but typically ranges on 
the order of seconds to minutes. We call this kind of attack intra-update sniping 
vulnerability. The second possible upgrade order, shown in Figure 1c, imple- 
ments the desired secure update order. Enforcing the second ordering requires 
the user to explicitly specify an ordering constraint that the authentication must 
be added before the backend computation is updated. 

Another instance of intra-update sniping vulnerability happens when compo- 
nents are added or removed from an infrastructure, but no ordering constraints 
are given between them and components that use them. As an example, suppose 
a user is adding a lamda that reads data from a new $3 bucket. If no depen- 
dency is specified, the lambda could be created and connected to the bucket 
before CloudFormation recognizes that the bucket is already owned. The at- 
tacker who owns this bucket may then inject their data into the user’s system 
during the time it takes CloudFormation to notice the naming conflict and roll 
back the migration. This is related to the issue of S3 bucket namesquatting [15]. 

Although this paper is mostly focused on Amazon’s infrastructure, we have 
successfully reproduced a similar scenario in Google Cloud, demonstrating that 
intra-update sniping vulnerabilities are not limited to one cloud provider. We 
reported this issue to Google, and although they acknowledged the problem, they 
explicitly stated that it is the responsibility of the user to ensure the security of 
their deployment. 


1.2 Detecting Intra-update Sniping Vulnerabilities 


We propose a tool, Hayha, that detects possible intra-update sniping vulnera- 
bilities and proposes solutions to users. Häyhä allows CloudFormation users to 
check the security of planned updates to their infrastructure, before they ac- 
tually deploy the update. Although our tool is specifically engineered to work 
with CloudFormation, this class of vulnerabilities is not limited to it, and the 
proposed solution is generic enough to be adopted in any other Infrastructure 
as Code language. 
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The main challenge in detecting intra-update sniping vulnerabilities is in de- 
termining the underlying issue with common deployment models that lead to the 
security vulnerability. We identify parallelism and in-place upgrades as the root 
causes, arguing there is a trade-off in Infrastructure as Code between security 
and scalability. On the opposite side of this trade-off, some practitioners advo- 
cate for Immutable Infrastructure [12] management, which re-builds the entire 
infrastructures from scratch on each update and only switches atomically to the 
new infrastructure when it is ready. This practice would guarantee atomicity of 
updates to the infrastructure and the absence of intra-update sniping vulnera- 
bilities. However, this comes with a huge cost in terms of scalability and does 
not apply well when statefulness is required (for example, migrating an existing 
database), making it a less attractive practice. 

Naturally, there is a connection between intra-update sniping vulnerability 
and the problem of data races and concurrent access. Our proposed solution, of 
adding ordering constraints, is somewhat similar to generic tools in the concur- 
rency domain, such as memory barriers or locks [19,16,24], that add constraints 
to the order of execution of a program. However, the focus of our work are config- 
uration files that describe infrastructure, not programs. We cannot simply apply 
existing work, because these configuration files do not have a formal semantics, 
creating this way an additional challenge for our problem domain. 

In summary, we identify the following key contributions of this paper: 


— The description of intra-update sniping vulnerabilities and how they arise in 
IaC services, with examples in AWS and Google Cloud. 

— An intermediate representation of IaC configurations that allows us to reason 
about security and network properties of a deployment, as well as about 
changes in deployments. 

— A tool, Hayha [17] that statically checks for potential intra-update sniping 
vulnerabilities in a proposed infrastructure update. 

— An evaluation on CloudFormation files scraped from GitHub, showing Häyhä 
scales and runs fast enough to be adopted into developer workflows. 


2 A Model for Infrastructure as Code 


Our tool, Hayha, detects the possibility of a sniping attack in future deployments. 
It analyzes the given deployment and raises alarms when it detects potential 
security issues. The tool follows steps that we further detail in this section. 
Step 1: Internal representation. First, Hayha reads the configuration of 
the current and target infrastructure and translates them to the internal repre- 
sentation. This representation is a dataflow graph identifying which component 
of the infrastructure has access to which other components, and under which 
security assumptions. Figure 2 shows two such simplified dataflow graphs that 
our tool built from arch in Fig. 1. From this graph, Hayha learns the desired 
security level of each component. In this section we describe how to compute 
security levels of resources in a given CloudFormation file: in Section 2.1 we de- 
scribe the concrete syntax of a general CloudFormation file and how it applies 
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Web Web 
| | Authorizer 
PublicGet PrivateGet 
PublicLambda PrivateLambda 
(a) An Initial Dataflow Graph (b) A Target Dataflow Graph 


Fig. 2: Dataflow graphs derived from an infrastructure 


to other IaC tools; in Section 2.2 we describe how we model an infrastructure 
in terms of network communication and security; finally, in Section 2.3 we show 
the execution semantics and computation of the security level of resources in an 
infrastructure. 

Step 2: Capturing all potential upgrade states. After the initial and 
target configurations are converted to our model, Hayha builds an upgrade state, 
designed to represent every possible intermediate infrastructure that could exist 
during the upgrade. In Section 2.4 we formally define the upgrade semantics 
from an initial state to a target state in terms of our model, while in Section 3.1 
we show how the upgrade state is built in practice. Figure 3 shows such a state, 
in form of a graph, which contains a path (Web to PublicGet to PrivateLambda) 
allowing any user on the web to access a sensitive resource in a non-secure 
manner. Finally, in Section 3.2 we discuss how dependency relations refine the 
upgrade state. 


Web 
oe 
PublicGet PrivateGet 
PublicLambda }———— PrivateLambda 


Fig. 3: Upgrade State with a Path Exposing a Security Vulnerability 


Step 3: Analysis. (Section 3.3) Häyhä computes an over-approximation of 
the intermediate states and the security level of their nodes in order to answer 
two questions: 1) is every node in every possible intermediate state at least as 
secure as the corresponding node in the initial or target configuration? and 2) 
does every node in every possible intermediate state communicate only with 
existing nodes? Any possible violation is reported to the user so they can take 
action and modify their target configuration accordingly. For example, using the 
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DependsOn keyword, one can enforce build orders in a CloudFormation file. For 
Figure 3, Hayha reports the possible insecure access to PrivateLambda: 

Resource PrivateLambda is not sufficiently protected, it needs at 
least Authorizer and is protected by None during upgrade. Add DependsOn 
properties to ensure correct security. 


2.1 CloudFormation Infrastructures 


CloudFormation uses a declarative language in which users can specify the de- 
sired state of their system. An example of a CloudFormation file is given on 
the left side of Figure 4. It shows a simplified example of an infrastructure in 
which an API can be called to access the result of running a Lambda (a sim- 
ple function). There are no formal semantics for CloudFormation files [4,9] — 
they are simply YAML or JSON files created from the given AWS CloudForma- 
tion templates. Other tools, such as Terraform by HashiCorp, follow a similar 
template-based design. 

To formalize the behavior of IaC languages, we would also need to formalize 
the precise behavior of components. However, these components are very diverse, 
ranging from firewalls and HTTP servers to general purpose machines or even 
entire network configurations. Fortunately, the intra-update sniping vulnerability 
is independent from the precise behavior of individual components, and we only 
need to analyze the network and security behavior of the infrastructure. We 
only track the security level of requests, and abstract away from their content. 
To describe our model, we need to introduce three concepts used in IaC: 

A component of the infrastructure is called a resource. Every configuration 
file declares a set of resources and their configurations (e.g. Figure 4). Some 
resources, like the LambdaExecutionRole and the LambdaPermission are secu- 
rity resources, and they prevent an unauthorized use of other resources. Other 
resources, like the GreetingLambda and the GreetingRequestGET are actual run- 
ning processes, the later also being publicly accessible. Finally, some resources 
do not correspond to a running process, but to a group of resources such as 
GreetingApi that gives some configuration value to every resource in the group. 

A resource’s configuration may reference other resources, and we record that 
information in our model. Based on the CloudFormation documentation, we 
distinguish different types of references that we list below: 


— network references(r, r’) are directed network connections between two 
components r and r’, that allow r to send requests to r’, and receive answers. 

— incoming protection references(r, s) protect all incoming requests to a 
resource r, using a security resource s. 

— outgoing protection references(r, s) protect all outgoing requests from 
a resource r, using a security resource s. 

— connection protection references(r, r’, s) protect a specific connection 
between two resources r and r’ using a security resource s. 

— collection references(c, r) specify a resource r is in a specific collection 
resource C. 
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CloudFormation File Corresponding Model 


{ ” Resources”: { 


” LambdaPermission”: { LambdaPermission [security] 
Type”: ” AWS::Lambda::Permission” , intrinsic security: LambdaPermission, 
” Properties”: { 
”FunctionName”: ” GreetingLambda” , connection security(GreetingApi, Greet- 
” SourceArn” : ” GreetingApi” ingLambda, this) 
} 
} 
” GreetingLambda”: { GreetingLambda 
Type”: ” AWS::Lambda::Function” , intrinsic security: T 
” Properties”: { 
” Role”: ” LambdaExecutionRole” 
} 
} 
” GreetingRequestGET”: { GreetingRequestGET [public] 
” Type”: ” AWS::ApiGateway::Method”, intrinsic security: T, 
” Properties”: { 
” Integration”: ” GreetingLambda” , network(this, GreetingLambda), 
”RestApild”: ” GreetingApi” collects(GreetingApi, this) 
} 
b 
” GreetingApi”: { GreetingApi [collection] 
”Type”: ” AWS::ApiGateway::Api” intrinsic security: T 


*LambdaExecutionRole”: { 
” Type”: ” AWS:IAM::Role” 
” Properties”: { 


IHI 


Fig. 4: Mapping Between a CloudFormation File and our Model 


Each of these reference types can be present in any resource, any number 
of time. The resource it is declared in can take any role in the relation that it 
defines, and we represent the resource as this in the model, as shown on the 
right side of Figure 4. 

In CloudFormation, a dependency is declared by using e.g. the DependsOn 
keyword. A dependency restricts the order in which updates can occur: before a 
resource can be updated, all the resources it depends on must have been updated. 


2.2 Model of a CloudFormation Infrastructure 


We now describe a model for a CloudFormation infrastructure. We define a 
state S = (R, D) as a set of resources and a partial order that represents the 
dependency relation between resources. A resource is a tuple composed of a name 
(string), a type, an intrinsic security context, an origin flag, the different types 
of references discussed above, and the original configuration of the resource. 

With (id, id’) € D we denote that id depends on id’, and that id cannot be 
upgraded until id’ is upgraded. 

The origin flag denotes whether the resource comes from the initial state or 
the target state during an upgrade, but it is not used at all when dealing with 
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a single state. Similarly, the original configuration’s type is not further defined, 
and depends on the vendor. It is not used for a single deployment, and we only 
use it to check for equality of resources when updating an existing deployment. 

Inspired by Abstract Interpretation [10], we define a security context as an 
abstract domain with a partial order and some abstract operations: a top, a 
bottom, a meet, and a join. When two security contexts are comparable (a E y), 
we say that x is less permissive than y, or that x is more secure than y. 

We define predicates that can help us to express some properties of resources 
in a specific state S: collection(r), resp. security(r), means that r is a resource 
whose type is that of a collection resource, resp. a security resource. We use 
public(r) to denote when r is a resource whose type is that of a resource that 
can be accessed from anywhere on the internet (although this might be restricted 
with security references), or if it is contained in a collection that is itself publicly 
accessible. 


Definition 1 (connection). A connection is possible between two resources 
when there is a network reference between them or resources that collects them. 
network re ference(c, c’) 
ref(r,r’) 4> 3c, d. A4 r= cV collects(c,r) 
r’ = d V collects(c',r’) 


The security of a connection is the minimum security level a request from 
r must have to be able to reach r’ directly. This definition reflects the fact 
that, when a connection is secured by multiple security resources, it must have 
sufficient authority to be accepted by all of them. 


Definition 2 (connection security). 


incoming protection(c, s) 
c, c.V $ outgoing protection(c', s) 
security(r,r’) <> N 4 sec(s) connection protection(c,c', s) 
: (r = c V collects(c,r)) 
wie { (r! = e V collects(c',r’)) 


2.3 Execution Semantics 


The execution semantics for our intermediate representation is given below. The 
semantics explains which resources are allowed to talk to which resources, and 
under which security level. When we write L F r > r’, it means that r is allowed 
to send a request to r’, under the security level L. 

A request can come from the internet (represented with the constant W) 
and reach a public resource r’ if it has a sufficient security level L. Similarly, a 
request can come from a resource r and reach r’ if it has a sufficient security 
level, r’ is not a collection, and both resources have an adequate configuration 
that allows them to communicate. 


r'e R -collection(r') LC security(W,r')  public(r’) 


OutsideRequest 
x LFW >r 
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(r,r’) € R? —collection(r') LC security(r,r’) ref(r,r’) 
Letra?’ 


InternalRequest 


A path P is a finite sequence of resources whose first resource is public, 
and subsequent resources can be reached from the previous, using the above 
semantics under some security level. The security of a path is then defined as 
the minimal security level under which every node can be reached in the above 
semantics: 


security((T1,..-,1%)) = ML, security(ri-1, ri) 


with ro = W. We note W —* r the set of paths whose last element is r. 
Similarly, the security of a node is defined as the minimal security level under 
which the node can be reached by at least one path: 


Sec(r) = V {security(P)|P € W >* r} 


When the infrastructure, under which we consider the security of resources, 
is not clear from the context, we clarify that with a subscript Secgs(r). 


Definition 3 (Substate). When comparing two states, Sı and S2, we say that 
Sı (a So when 


— Every resource of Sı is a resource of S2 and 
— For every pair of resources r,r’ in S1, if L- r — r holds in Sı, then it also 
holds in S2. 


Our first lemma states that, when a state is a substate of another, its nodes 
are at least as secure as the other. 


Lemma 1 (Substate Security). 
Vsi, Sə. Vid € Sy. S1 C Sp => Secs, (id) E Secs, (id) 


Proof. We note that by definition, id is in both states. Additionally, any path in 
Sı is also a path in S2, and since the security of connections in S4 is more secure 
than the same connections in S2, the security of paths in Sı is greater than the 
security of the same paths in Sp. 

The security of a node is the meet of the security of paths that lead to it in 
the state. Paths that lead to id is S2 are the paths that lead to it in S1, and 
potentially additional paths. Therefore, the security of id in Sı is greater than 
in So. 


2.4 Upgrade Semantics and Security Policy 


In IaC tools, an upgrade changes a given infrastructure state to a new state. This 
is done by upgrading each node that needs to be changed as specified by the 
new configuration. Generally, nodes are upgraded in an unspecified order, even 


114 J. Lepiller et al. 


in parallel, to improve deployment speed. Node updates are sent asynchronously 
to every service that needs to be updated, and there are dozens if not hundreds 
of steps each service must take to complete its update. When these upgrades 
are sent in parallel, it is difficult to reason about the state of the system as the 
running time for a node upgrade depends on the latency of the service. To model 
this behavior, we define an interleaving semantics for upgrades. 

An upgrade starts in an initial state S; and ends in a target state S;. Ad- 
ditional dependency ordering information is provided by the relation D of the 
target state. 

The configuration of an identifier can be updated if all its dependencies are 
already updated (Vid', (id, id') E€ R => S(id’) = S;(id’)), and it has not been 
updated yet: 


S(id) £ S,(id) Vid, (id, id’) € R => Slid’) = S;,(id’) 


UpgradeCont 
A 5 > Slid — S,(id)| 


A new resource can be created under the same conditions, if it was not present 
in the initial state: 


idg S Vid', R(id,id’) => Slid’) = Silid’) 
S > Slid & S;(id)] 


An identifier can be removed, if it is not in the target state: 


UpgradeAdd 


idé S% ides 
S— S \id 


We collect every accessible intermediate state in a set denoted by Acc: 


UpgradeDel 


f SeEAc S> s 
eee AccNext Se Acc 

Note that, in the absence of any dependency, Acc contains every combination 
where each resource is either at its initial or target configuration, leading to 2” 
possible intermediate states when n is the number of changed resources. 

We next show that, when two identifiers are in a dependency relation, some 
intermediate states are not possible. For ease of expressing this lemma, we extend 
equality to also check whether id is in the domain of S. If id is neither in S$ nor 
S’, we have S(id) = S' (id). Otherwise, id must be in both and associated to the 
same configuration for the equality to hold. 


Lemma 2 (Dependency Restriction). 
V(id, id’) E€ R,S € Acc = > S(id)  Si(id) V S(id’) # Silid) V Si (id) = 
Si (id) V Silid’) = Silid’) 


Proof. By induction of S € Acc and by case analysis on the inequality that holds 
in the inductive case. 


We now define the security policy as: 
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Definition 4 (Security Policy). A deployment from S; to Sı is secure iff: 


Sec(S,id) E Sec(S;, id) if Silid) = S(id) 
VS € Acc, Vid, < Sec(S, id) E Sec(S;, id) if Silid) = S(id) 
Sec(S,id) = L otherwise (id is not in S) 


Our work focuses on security issues that happen during upgrades, assuming 
that the initial and target states are both secure. We require that in any inter- 
mediate state any resource is at least as secure as their counterpart in the initial 
or target state, depending on where their configuration comes from. 


3 Architectural Design of the Häyhä Tool 


3.1 Upgrade States 


To verify the security of intermediate states, we could compute all the possible 
intermediate states and pass them to existing tools that could check the secu- 
rity of such states. However, this approach has two main drawbacks. First, we 
would need to construct 2” intermediate states, which does not scale for large 
infrastructure changes. Second, the result of such tools would not be easy to 
understand for end users, as they would report issues with states that are not 
defined or even considered by the user. Our goal is a tool that is both scalable 
and able to provide suggestions on how to change the target configuration, not 
some hidden intermediate configuration. 


Web Web Web 
| | auras | auras VT =F 
GET GET GET 
lambda? lambda? lambda! lambda? 


(a) Graphical (b) Graphical Rep- (c) Graphical Representation of the 
Representation of resentation of the Upgrade State 
the Initial State Target State 


Fig. 5: Example Upgrade State 


To address scalability we introduce upgrade states which represent multiple 
states on which we can apply the same execution semantics. Recall that a state 
is composed of a list of resources with their origin, type and references, and 
of a dependency relation. An upgrade state is composed in the same way. The 
set of resources is the union of the resources from the initial and target states, 
excluding initial resources that only differ from their target counterpart by their 
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provenance flag. When resources are added or removed from an infrastructure, 
we introduce an empty resource for each of them. They represent the absence of 
these resources. The dependency relation of the upgrade state is the dependency 
relation of the target state. 

The execution semantics of an upgrade state is the same as the execution 
semantics of a normal state. Since the upgrade state represents multiple versions 
of the same resources at the same time, we need to change the definition of the 
security level of a connection between resources. An example of an upgrade state 
is given in Figure 5. The initial state has an API, a GET method and a lambda, 
and everything is public. The target state modifies the lambda and adds an 
authorizer. The upgrade state is comprised of the unchanged API, the target 
authorizer (with an empty resource as its initial counterpart), the GET method 
(which did not change), and the two variants of the lambda. The connection 
to the GET method is protected either by the empty node (T) or the target 
authorizer. The minimal security level for this connection is therefore T. 

In summary, when a security resource is relevant for a connection, we need 
to consider its counterpart that has a different provenance flag. If it is also 
relevant, the connection is protected by the disjunction of the security level of 
these resources (they cannot both exist at the same time, but one of them exists 
at any given time). If it is not relevant, the upgrade state represents at least one 
case where the security resource is not relevant, meaning that the connection 
is protected by the disjunction of the first security level and T, which is T 
(no security at all). If the counterpart is an empty resource, the upgrade state 
represents at least one case where the security resource was deleted (or not yet 
added), so the connection is also unprotected. If there is no counterpart, the 
connection is simply protected by the resource, because it does not change in 
any way during the upgrade. 

We denote by U(S;, S+) the upgrade state created from the initial state S; 
and the target state S;. We now show that this state indeed collects all possible 
intermediate states. 


Lemma 3 (Upgrade Graph is an Overapproximation). 
VS € Acc.S' C U(S;, S+) 


Proof. To apply the definition, we first show resources of S$ are resources of U. 
Then, we show that any connection in S is a connection in U, because resources 
come with the same references in both states. 


3.2 Splitting Dependencies 


We have seen that the upgrade state created from the initial and target configu- 
rations is an over-approximation of all the intermediate states, when we do not 
consider dependencies. Because dependencies reduce the number of intermedi- 
ate states, the upgrade state might not be precise enough and might produce a 
warning when no actual intermediate states violate the security policy. 
Variants. When the state has two nodes A and A’ with the same identifier, 
but a different label, we call them a variant of one another. When A belongs to 
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the initial configuration and A’ to the target configuration, (A, A’) is called an 
upgrade pair. 

We refine the upgrade state by splitting it along a dependency. Considering 
a state S, its dependency relation D, and two target resources (A’, B’) € D, 
the split of S, split(S, A’, B’) is a set of upgrade states. Suppose A’ and B’ are, 
respectively, part of an upgrade pair (A, A’) and (B, B’). Then, split(S, A’, B”) 
is the set of three upgrade states, where only one of A or A’ remains, and only 
one of B or B’. We exclude the case where A’ and B remain. When any of these 
nodes does not exist, the number of possible combination is reduced. When only 
A’ and B exist in S, we have found an impossible situation, and the result of 
splitting is the empty set. 

Although this process creates an exponential number of states, the number of 
dependencies tends to be limited in practice, because they slow upgrades down. 
At the same time, a big number of dependencies actually reduces the number of 
possible intermediate states, until every node is in a dependency, in which case 
there are exactly n intermediate states. 

We now prove that splitting the upgrade state is correct, in the sense that 
the set of states split(S) still contains all the possible intermediate states (Acc): 


Theorem 1 (Correct Split). 
VS € Acc. Ju € split(U(S;, S;)). S Cu 


Proof. Let us take a state S € Acc from the set of all possible intermediate 
states. Since splitting a state according to a dependency preserves the states 
from Acc (Lemma 4 below), we can consider every dependency and split them 
in any order. Initially, it holds that S C U(S;, S+), using Lemma 3. 

Consider an upgrade state u such that S C u and D(id, id’). By Lemma 4, 
we can find a state u’ € split(u,id,id’) such that SC w. 

After applying this for each dependency, u’ is one of the states resulting from 
split(U (Si, S+)), and the claim of the theorem holds. 


The following intermediate lemma is needed to prove the correction of the 
split. It states that if a state contains one of the accessible states, splitting a 
dependency in it results in a set of states, where one of them still contains this 
intermediate state. 


Lemma 4 (Split Graphs). VS € Acc. V(id,id') €E D.S Cu => W e€ 
split(u, id, id’), S C u’ 


Proof. Take (A, A’) the upgrade pair whose identifier is id. Similarly, take (B, B’) 
the upgrade pair whose identifier is id’. Since S € Acc, A’ and B cannot both 
exist at the same time in S (Lemma 2). Since S C u, we also know that u has 
at least one variant of id and one variant of id’, the ones that are present in S. 

The states from split(u, id, id’) are composed of the same nodes as u, except 
for id and id’, where they all have one of the four possible combinations of 
initial and target states, except for the pair A’, B. Since S doesn’t have them 
both either, one states has the same variants of id and id’ as S, and we call it 
u’. We now show that S C w’. 
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First, we note that u’ has the same nodes as u, except for those with identifier 
id and id’. For any resource in S, the resource was present in u, so it is also in 
u’, unless it has identifier id or id’. For this last cases, we note that u’ is defined 
to contain the same variants as S, so the resources of S' are also resources of u’. 

Second, if we take L F r + r’ in S, we can use the same reasoning as in 
Lemma 3 to conclude that is also holds in u’. Thus we conclude that S$ C w’. 


3.3 Finding Vulnerabilities 


After Häyhä constructs the upgrade state, the next step is to check for security 
issues. Although we could split the upgrade state recursively until no dependency 
remains, a more interesting strategy is to immediately check the upgrade state for 
issues. If none is found, it is not necessary to refine the upgrade state. Otherwise, 
we try to find a relevant dependency and split the upgrade state on it, running 
the analysis on the resulting states, splitting on other dependencies as needed. 

Our analysis detects two types of issues: first, if an empty node is accessible, 
it might be used by the infrastructure at a point it is not registered by the 
owner of the infrastructure. This is the case for a new node that is accessible 
before it is created. When that node is a resource that can be claimed by a third 
party (such as an $3 bucket), the attacker might be able to register it before the 
user. Similarly, for a deleted resource, an attacker could register it for themselves 
before the user stops using it. 

Second, the security context of every node in the upgrade state is compared 
to the security of the same node in the initial or target state (depending on its 
provenance flag). When its security is strictly lower than the security of the node 
in the state it comes from, or incomparable, we raise an alarm because there is 
an intermediate step where the resource might not be sufficiently protected. 

Using Lemma 1 and Theorem 1, when the security of a node in a possible 
intermediate state (collected in Acc) is insufficient, the security of that node in 
at least one split upgrade state is even lower. Therefore, if there is a violation of 
the security property, our tool will detect it. 


4 Experiments 


Hayha is designed to be used before the deployment of a CloudFormation update, 
and it is crucial that Häyhä does not interrupt developer workflow. Our goal 
was, therefore, to evaluate the scalability of Hayha on a variety of real-world 
CloudFormation updates. To do this, we collected 36 CloudFormation files from 
GitHub, where each file had a history of updates (commits). We ran Hayha 
against every update recorded in GitHub to that file, and measured the running 
time. We found that our analysis completed within one seconds for all files — we 
believe that these results indicate that Häyhä could be integrated in developer 
workflow with minimal disruption to the user. The details of the evaluation 
dataset are given in Fig. 6. 
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Fig. 6: Analysis time of various CloudFormation files from GitHub. Point size is 
proportional to the number of updated resources, which are between 0 and 31. 


To collect the set of GitHub CloudFormation files used in our scalability 
benchmark, we searched GitHub using the web search tool for code with the key- 
word AWSTemplateFormatVersion - which is a required keyword for any Cloud- 
Formation file. We then filtered by the .yaml extension, and further manually 
filtered for valid CloudFormation files (as opposed to other languages with over- 
lap). Since we wanted to track updates to these files, we also filtered manually 
to find only files that had a revision history (> 2 commits for the file). 

While we showed that Häyhä scales well on real world data, we did not iden- 
tify any instances of intra-update sniping vulnerability in these files. This is an 
expected result, as the CloudFormation files we found on GitHub were generally 
designed as templates that developers would customize to their own needs. We 
believe application-focused CloudFormation files are not often uploaded, since 
CloudFormation files can contain sensitive and proprietary information (e.g. in- 
frastrucuture design). In order to run a large-scale analysis to check for past 
instances of intra-update sniping vulnerability, we would need access to a repos- 
itory of the private user data for many CloudFormation users. 


5 Related Work 


Following the development and use of Infrastructure as Code (IaC) practices, 
many threats and security challenges were recognized [26,27]. The security risks 
that have been identified in IaC have thus far remained similar to existing vul- 
nerabilities arising from poor security practices, such as infrequent key rotation 
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and hard-coded secret values [25]. Additionally, despite existing recommenda- 
tions and good practices when dealing with cloud infrastructure, many existing 
deployments are still left insecure by user misconfigurations. For example, stor- 
age “buckets” which host files, should generally be configured by user to disallow 
world readable/writable permissions. However, in practice, users struggle with 
this [8]. Existing work has used SMT solver to automatically detect such vul- 
nerabilities and help users secure their resources [4,9]. In contrast, we focus on 
the dynamic behavior of deployment updates that occur when using IaC tools, 
and their effect on security configuration. 

Much work has focused on the security of virtualization technologies based 
on attack models such as malicious cloud users to compromised cloud providers, 
as summarized in [13]. In our work however, we do not make any assumption 
on the specific technology, as intra-update sniping vulnerabilities rely mostly on 
timing and insecure configuration on the user’s side. 

Our work is based on a graph model of the dataflow network of resources 
created in an infrastructure configuration. Similarly, Al-Shaer et al [2] propose 
to model and check network security using a graph-based model of the network. 
As with other work on the network and infrastructure security [5,18], the focus 
of the analysis is on the security of static network topologies, instead of the 
security of a moving topology, as we have in this paper. The analysis of security 
in static networks and static information flow models [21] is complementary to 
our work, as we assume the initial and target infrastructure are secure. 

Beyond network configurations, there has been work in the analysis of con- 
figuration files. In particular, static analysis has been used to check that IaC 
configurations are idempotent [14,30], an important property for maintaining 
reproducibility of infrastructure. The reproducibility of infrastructure is known 
to be a challenge [7], despite IaC being declarative and version controlled. Fur- 
ther efforts have used probabilistic modelling to learn constraints on configura- 
tions [22,28,29]. 


6 Conclusion 


We have identified a new class of vulnerability that applies to Infrastructure 
as Code services, intra-update sniping vulnerabilities, that arise from a lack of 
ordering in upgrading resources. We presented a tool, Hayha, that detects such 
vulnerabilities in CloudFormation, and gives feedback to users on how securely 
update their infrastructure deployment. Our evaluation shows the scalability of 
Hayha by running it on existing configurations from GitHub and found that it 
runs quickly enough to be usable in practice. 
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Abstract. The first-order theory of rewriting is a decidable theory for 
linear variable-separated rewrite systems. The decision procedure is based 
on tree automata techniques and recently we completed a formalization 
in the Isabelle proof assistant. In this paper we present a certificate 
language that enables the output of software tools implementing the de- 
cision procedure to be formally verified. To show the feasibility of this 
approach, we present FORT-h, a reincarnation of the decision tool FORT 
with certifiable output, and the formally verified certifier FORTify. 


1 Introduction 


Many properties of rewrite systems can be expressed as logical formulas in the 
first-order theory of rewriting. This theory is decidable for the class of linear 
variable-separated rewrite systems, which includes all ground rewrite systems. 
The decision procedure is based on tree automata techniques and goes back to 
Dauchet and Tison [7]. It is implemented in FORT [17,18]. FORT takes as input 
one or more rewrite systems Ro, R1,... anda formula y, and determines whether 
or not the rewrite systems satisfy the property expressed by y, in which case it 
reports yes or no. FORT may not reach a conclusion due to limited resources. 

For properties related to confluence and termination, designated competitions 
(CoCo [15], termCOMP [9]) of software tools take place regularly. Occasionally, 
yes/no conflicts appear. Since the participating tools typically couple a plethora 
of techniques with sophisticated search strategies, human inspection of the out- 
put of tools to determine the correct answer is often not feasible. Hence certified 
categories were created in which tools must output a formal certificate. This 
certificate is verified by CA [21], an automatically generated Haskell program 
using the code generation feature of Isabelle. This requires not only that the 
underlying techniques are formalized in Isabelle, but the formalization must be 
executable for code generation to apply. During the time-consuming formaliza- 
tion process, mistakes in papers are sometimes brought to light. 
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Since 2017 we are concerned with the question of how to ensure the correct- 
ness of the answers produced by FORT. The certifier CelIA supports a great many 
techniques for establishing concrete properties like termination and confluence, 
but the formalizations in the underlying Isabelle Formalization of Rewriting 
(IsaFoR)? are orthogonal to the ones required for supporting the decision proce- 
dure underlying FORT. We recently completed the formalization of the automata 
constructions involved in the decision procedure [14]. Earlier fragments were de- 
scribed in [8,13]. In this paper we put these efforts to the test. More precisely, 
we 


1. present a certificate language which is rich enough to express the various au- 
tomata operations in decision procedures for the first-order theory of rewrit- 
ing as well as numerous predicate symbols that may appear in formulas in 
this theory, 

2. describe the tasks required to turn the formalization described in [14] into 
verified code to check certificates within reasonable time, 

3. present a new reincarnation of FORT in Haskell, named FORT-h, which is 
capable of producing certificates. 


The remainder of the paper is organized as follows. The next section briefly 
recapitulates the first-order theory of rewriting and the variant of the decision 
procedure described in [14]. Sections 3 and 4 describe the representation of for- 
mulas in certificates and the certificate language. In Section 5 we describe how 
certificates are validated by FORTify, the verified Haskell program obtained from 
the Isabelle formalization. Section 6 describes FORT-h. Experimental results are 
presented in Section 7, before we conclude in Section 8. 


2 Preliminaries 


Familiarity with term rewriting [2] and tree automata [6] is useful, but we briefly 
recall important definitions and notation that we use in the remainder. 

Terms 7(F,V) are constructed from a signature F, consisting of function 
symbols with fixed arities, and a set of variables V. A term rewrite system (TRS 
for short) R consists of rewrite rules £ —> r between terms @ and r. Instead of the 
usual restrictions / ¢ V and Var(r) C Var(¢), we require Var(¢)NVar(r) = Ø. Here 
Var(t) denotes the set of variables in a term t. Moreover, £ and r are assumed to 
be linear terms (i.e., variables occur at most once). The conditions on the rewrite 
rules are necessary to ensure decidability of the first-order theory of rewriting for 
these linear variable-separated TRSs. The (one-step) rewrite relation of a TRS 
R is denoted by +z. A term t is ground if Var(t) = Ø. The set of ground terms 
is denoted by T(F). 

The first-order theory of rewriting is defined over a language £ containing 
the predicate symbols +, >*, =, and many more. As models, we consider finite 
linear variable-separated TRSs R over signatures F such that 7(F) is non- 
empty. The set 7 (F) serves as domain for the variables in formulas over £. The 


3 http: //cl-informatik.uibk.ac.at /software/ceta/ 
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interpretation of the predicate symbol — in R is the one-step rewrite relation 
>r over T(F), —* denotes the restriction of >% to terms in 7 (F), and = is 
interpreted as the identity relation on 7 (F). Since we use ground terms as car- 
rier, formulas in the first-order theory of rewriting express properties on ground 
terms. For instance, the following formula y expresses the property of having 
unique normal forms (UNR): 


VsVtVu(s>* tA=3v(t >v) As >* uA7du(u>v) = t=u) 


To use y for establishing UNR for arbitrary terms (i.e., terms in T(F,V)) two 
additional constant symbols need to be added to the signature [18]. (More on this 
in Section 8.) Additional predicates in £ increase the expressive power and also 
allow expressing properties more compactly. For instance, we can write NF(t) 
for av (t > v) and s >' t for s >* tA Ju (t > v). In Section 3 we present a 
grammar that describes the available constructions for predicates. All predicates 
that can be represented using these constructions are supported in our decision 
procedure. 

The decision procedure is based on tree automata that recognize relations 
on ground terms. Here we give a brief summary. More information can be found 
in [6] and [14]. A tree automaton A = (F, Q, Qf, A) consists of a finite signature 
F, a finite set Q of states, disjoint from F, a subset Qs C Q of final states, 
and a set of transition rules A. Transition rules have one of the following two 
shapes: f(p1,---,Pn) > q with f € F and pi,...,pn,q E€ Q, or p > q with 
p,q € Q. The latter are called epsilon transitions. Transition rules can be viewed 
as rewrite rules between ground terms in 7 (FUQ). The induced rewrite relation 
is denoted by +, or —>4. A ground term t € 7 (F) is accepted by A if t >% q 
for some q € Qg. The set of all accepted terms is denoted by L(A) and a set L 
of ground terms is regular if L = L(A) for some tree automaton A. 

We encode n-tuples with n > 1 of ground terms as terms over an enriched 
signature, as follows. We write F” for the signature (F U {L})" where L ¢ F 
is a fresh constant. The arity of a symbol f,--- fa € F™ is the maximum of 
the arities of fi,..., fn. The encoding of terms t1,...,t, E T (F) is the unique 
term (t1,...,tn) € T(F\™) such that Pos((t1,...,tn)) = Pos(t1) U---UPos(tn) 
and (t1,...,tn)(p) = fı- fn where fi = tilp) if p € Pos(t;) and fi = L 
otherwise, for all p E€ Pos((ti,...,tn)) and 1 <i < n. As an example, for the 
terms s = f(g(a),f(b,b)), t = g(g(a)), and u = f(b, g(a)) we obtain (s,t,u) = 
fgf(ggb(aal),flg(bLa,bL1)). An n-ary relation on ground terms is regular if 
its encoding is accepted by a tree automaton operating on terms in T(F()). 
Such an automaton is called an RR, automaton and regular n-ary relations are 
called RR, relations. The i-th cylindrification of an RR, relation R over T(F) is 
the RR,41 relation {(t),...,t;-1,U,ti,---,tn) | (t1,---,tn) €E R and u € T(F)}. 

Besides RR, automata, the decision procedure makes use of ground tree 
transducers (GTTs for short). A GTT is a pair G = (A,B) of tree automata 
over the same signature F. A pair (s,¢) of ground terms in 7 (F) is accepted by 
G if s +4 u ġe t for some term u € T (F U Q). Here Q is the combined set of 
states of A and B. The set of all such pairs is denoted by L(G). We denote by 
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L(G) the set of all pairs (s,t) such that s +4 q + t for some state q € Q. A 
binary relation R on ground terms is a(n anchored) GTT relation if there exists 
a GTT G such that R = L(G) (R = L,(G)). The decision procedure for the first- 
order theory of rewriting described in [7] and implemented in FORT uses GTTs, 
the formalized variant described in [14] uses anchored GTTs (aGTTs), which 
have better closure properties. Both are supported in our certificate language, 
but FORT-h and FORTify use anchored GTTs since they permit us to model 
more predicates while reducing the need for ad-hoc constructions that need to 
be turned into executable (verified) code. 

The decision procedure for the first-order theory of rewriting constructs RR, 
automata for the subformulas in a bottom-up fashion. GTTs (aGTTs) come 
into play for some of the atomic subformulas consisting of predicate symbols and 
variables. Closure properties take care of the logical structure of formulas. A final 
emptiness check determines whether the formula is satisfied for the TRS given 
as input to the decision procedure. Rather than formally stating the properties 
involved, we illustrate the decision procedure on an example. 


Example 1. Consider the formula p = Vs dt(s >* t A NF(¢)), which expresses 
the normalization property of TRSs. To determine whether a TRS R over a 
signature F satisfies y, we first construct an RR, automaton A; that accepts 
the ground normal forms in 7 (F), using an algorithm first described in [5] and 
recently formalized in [13]. For the subformula s +* t we construct a GTT G, for 
the parallel rewrite relation >r. Since GTT relations are effectively closed under 
transitive closure (while RR» relations are not), we obtain a GTT Gə for >}. 
This GTT is transformed into an RR2 automaton Ag. (In the formalized decision 
procedure described in [14], an RR2 automaton for >* is constructed from an 
anchored GTT for the root step relation +, using suitable closure properties of 
anchored GTT and RRg relations.) We cylindrify the RR; automaton A; into an 
RRz automaton Az that accepts 7 (F) x NFr. A product construction involving 
A» and A3 produces an RR2 automaton A4 for the subformula s +* t A NF(t). 
Projection yields an RR; automaton A; corresponding to dt (s >* tANF(t)). So 
y holds if and only if L(A5) = T (F). In FORT the V quantifier is transformed into 
the equivalent = 4 =. Hence complementation is used to obtain an RR; automaton 
Ag and the existential quantifier is implemented using projection. This gives an 
RRo automaton Ay which either accepts the empty relation Ø or the singleton 
set {()} consisting of the nullary tuple (). The outermost negation gives rise 
to another complementation step. The final RRp automaton Ag is tested for 
emptiness: L(Ag) = @ if and only the TRS R does not satisfy vy. 


3 Formulas 


The first step in the certification process is to translate formulas in the first-order 
theory of rewriting into a format suitable for further processing. We adopt de 
Bruijn indices [4] to avoid alpha renaming. 
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Example 2. Consider the formula 


forall s, t, u ([0] s ->* t & [1] s ->* u => 
exists v ([1] t ->* v & [0] u ->* v)) 


in FORT syntax. It expresses the commutation of two TRSs, indicated by the 
indices 0 and 1. Using de Bruijn indices for the term variables s, t, u, v produces 


VVV(2 35 1A2>7 0) = 42773 0A1-5 0) 


We refer to Example 4 for further explanation. 


The formal syntax of formulas in certificates is given below. Angle brackets 
( ) are used for non-terminal symbols. Here (rr2) denotes the supported binary 
regular relations, which are formally defined after Example 3. Likewise, (rr1) 
stands for regular sets (which are identified with unary regular relations). 


(formula) ::= (rri (rri) (term)) | (rr2 (rr) (term) (term) ) 
| Cand (formula) x ) | Cor (formula) *) | (not (formula) ) 
| (forall (formula) ) | (exists (formula) ) | (true) | (false) 
| (restrict (formula) ( (trs) + )) 


(term) ::= (nat) (trs) ::= (nat) | (nat) - (nat) == O]1]2|--- 


De Bruijn indices are used for (term) variables and (nat)- denotes a TRS 
with index (nat) in which the left- and right-hand sides of the rules have been 
swapped. The class of linear variable-separated TRSs is closed under this op- 
eration. We use it to represent the conversion relation <* of a TRS œR as the 
reachability relation -+* induced by the TRS RU RZ. 


Example 3. The commutation property in Example 2 is rendered as follows: 


(forall (forall (forall (or (mot (and (rr2 (step* (0)) 2 1) 
(rr2 (step* (1)) 2 0))) (exists (and (rr2 (step* (1)) 2 0) 
(rr2 (step* (0)) 1 0))))))) 


Here (step* (0)) denotes the RRə relation —>* induced by the first TRS (which 
is indexed by 0) and (rr2 (step* (1)) 2 0) represents the subformula [1] t 
->* v of the FORT formula in Example 2. 


We continue with the certificate syntax of RR; and RRg relations: 


(rri) = (terms) | (nf ( (trs) +)) | Cinf (rr2)) | Cproj (1| 2) (rre)) 
| Cunion (rr) (rri)) | Cinter (rri) (rri)) | (diff (rri) (rr) ) 


(gtt (gtt) (pos) (num)) | (product (rr) (rri)) | Cid (rri)) 
| Cunion (rr) (rr2)) | Cinter (rra) (rre)) | (diff (rr2) (rr) ) 


(To) 2: 


| (comp (rr2) (rr2)) | (inverse (rr2)) 
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(pos) := >=|e|> (num) i= >=|1]> 


(gtt) = (Croot-step ((trs) +)) | (inverse (gtt)) | (union (gtt) (git) ) 
| Cacomp (gtt) (gtt)) | Cgcomp (gtt) (gtt)) | (inter (gtt) (gtt) ) 
| Cacomplement (gtt)) | (atc (gtt)) | Cgtc (gtt)) 


Here (terms) refers to T(F), (nf ((trs) + )) to the normal forms (NF) in- 
duced by the union of the underlying TRSs, and (inf (rr2)) to the infinity 
predicate (INF) which is satisfied by all terms having infinitely many succes- 
sors with respect to the relation R. Furthermore, (proj (1|2) (rr2)) denotes 
projection (7) to the first (second) argument, (gtt (gtt) (pos) (num) ) the trans- 
formation of a GTT relation into an RRe relation with corresponding context 
closure (cf. [14, Section 3]), (id (rr,)) the identity relation on the underlying set, 
and (gtc (git) ) (Catc (gtt))) the (anchored) transitive closure of the underlying 
(anchored) GTT relation. 

The constructs defined above closely correspond to the formalized closure 
operations for the predicates in the first-order theory of rewriting, reported in [14] 
and summarized below: 


AS +.|A7>|AUA|At|At|AoA|AGA|AC| ANA 

R := A| R| RUR|ROR| R |T xT |=r 

T := T(F)|NF|INFR|TUT|TAT |T°|m(R)|m2(R) 

n as LS p := 2ļ|e|> 
Here A are anchored GTT relations ((gtt)), R are RRa relations ((rr2}), and T 
are regular sets of ground terms ((rr1)). 


For convenience of tool authors, we add a few other constructs to (rr2}. The 
certifier expands these to a sequence of basic constructs given above. 


(rrg) u= +++ | Cstep ((trs) +)) | (step= ( (trs) + )) 
| Cstept+ ((trs) +)) | Cstep* ( (trs) +)) | Cequality) 
| (parallel-step ( (trs) +)) | (root-step+ ( (trs) + )) 
| (non-root-step ( (trs) +)) | (join ((trs) + )) 


The complete list can be obtained from the accompanying website. 


4 Certificates 


A certificate for a first-order formula y explains how the corresponding RR, 
automaton is constructed. We adopt a line-oriented natural deduction style. The 
automata are implicit. This is a deliberate design decision to keep certificates 
small. More importantly, it avoids having to check equivalence of finite tree 
automata, which is EXPTIME-complete [6, Section 1.7]. 


(certificate) ::= ( (item) (inference) (formula) (info) » ) (certificate) 
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| Cempty (item)) | (nonempty (item)) 
(item) ::= (nat) (info) = (size (nat) (nat) (nat)) | => 


(inference) ::= (rri (rri) (term)) | (rr2 (rr2) (term) (term)) 
| Cand (item) * ) | Cor (item) * ) | (not (item)) 
| (exists (item)) | (nnf (item)) | --- 


Currently the (info) field only serves as an interface between the tool (which 
provides the certificate) and the certifier to compare the sizes of the constructed 
automata. In the future we plan to extend this field with concrete automata. 
This allows to test language equivalence of a tree automaton computed by a tool 
that supports our certificate language and the one reconstructed by FORTify, 
thereby providing tool authors with a mechanism to trace buggy constructions 
in case a certificate is rejected. 
We revisit Example 1 to illustrate the construction of certificates. 


Example 4. The formula p = Vsit(s >* t ^A NF(t)) expressing normalization 
is rendered as y’ = VA(1 +3 0A 0 € NF[0]) in de Bruijn notation. Here 1 refers 
to the variable s, the second and third occurrences of 0 refer to t, and the last 
occurrence of 0 refer to the first (and only) TRS, which has index 0. We construct 
the certificate bottom-up, to mimic the decision procedure. The first line is for 
NF [0]: 


(O (rri (nf (0)) 0) (rri (nf (0)) 0)) 
The components can be read as follows: 


— (item) = 0 denotes the first step in our proof, 

— (inference) = rri (mf (0)) O construct the automaton that accepts the 
normal forms and keeps track of the variable 0, 

— (formula) = rri (nf (0)) O denotes the subformula 0 € NF[0]; it is sat- 
isfiable if and only if the automaton constructed using the description in 
(inference) is not empty. 


The apparent redundancy will disappear when we continue. We proceed by ex- 
pressing the relation +5 and subsequently make sure that the second component 
of 6 is in normal form: 


(1 (rr2 (step* (0)) 1 0) (rr2 (step* (0)) 1 0)) 
(2 (and (1 0)) (and ((rr2 (step* (0)) 1 0) (rri (mf (0)) 0)))) 


Line 1 is similar to line 0. The inference step and 1 0 in line 2 constructs an RR 
automaton that accepts the intersection of the relations modeled in lines 1 and 
0. This automaton corresponds to A4 in Example 1. The cylindrification step 
from A; to A3 in Example 1 is left implicit. We continue with the projection of 
variable 0 and afterwards complement the resulting automaton. This is done by 
an exists followed by a not inference step: 
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(3 (exists 2) (exists (and ((rr2 (step* (0)) 1 0) 
(rri (nf (0)) 0))))) 

(4 (not 3) (mot (exists (and ((rr2 (step* (0)) 1 0) 
(rri (mf (0)) 0)))))) 


The inference steps until this point describe the construction of Ag in Example 1. 
We complete the certificate by introducing the remaining operators: 


(5 (exists 4) (exists (not (exists (and ((rr2 (step* (0)) 1 0) 
(rri (nf (0)) 0))))))) 

(6 (not 5) (mot (exists (not (exists (and ((rr2 (step* (0)) 1 0) 
(rri (nf (0)) 0)))))))) 

(7 (nnf 6) (forall (exists (and ((rr2 (step* (0)) 1 0) 
(rri (nf (0)) 0)))))) 

(nonempty 7) 


The nnf inference step does not modify the tree automaton computed in step 
6 (which corresponds to Ag in Example 1) but checks the equivalence of the 
formula in line 6 with the one of line 7, which corresponds to the input formula 
vy’. The equivalence check incorporates V elimination, negation normal form, 
and associativity, commutativity and idempotency of A and V. In the future 
we might add support for additional equivalences in first-order logic. The final 
step (nonempty 7) checks that L(Ag) # Ø. So this certificate claims that the 
input TRS is normalizing. For TRSs that do not satisfy y, the final line in the 
certificate would be (empty 7). 


In the previous example we intentionally skipped over some details to convey 
the underlying intuition. First of all, the (rr2) construct (step* (0)) is derived 
and internally unfolded via (anchored) GTTs into 


(gtt (gtc (root-step 0)) >= >) 


Starting from an anchored GTT that accepts the root step relation induced 
by the first (and only) TRS in the list, an application of the GTT transitive 
closure operation followed by a multi-hole context closure operation with at least 
one hole that may appear in any position, an RR2 automaton that accepts the 
relation —>ġ is constructed. We also mentioned that cylindrification is implicit. 
The same holds for the projection operation that is used in the exists inference 
steps. A projection takes place in the first component if the variable 0 is present 
in the list of variables, otherwise the inference step preserves the automaton. 
This approach is sound as variables indicate the relevant components of the RR, 
automaton. Thanks to the de Bruijn representation, the innermost quantifier 
refers to variable 0, the first component in the given RR, automaton. However 
we must keep track of all variables occurring in the surrounding formula and 
update that list accordingly. 


5 FORTify 


The example in the preceding section makes clear that a certificate can be viewed 
as a recipe for the certifier to perform certain operations on automata and for- 
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mulas to confirm the final (non-)emptiness claim. In particular, checking a cer- 
tificate is expensive because the decision procedure for the first-order theory is 
replayed using code-generated operations from a verified version of the decision 
procedure. In this section we describe the steps we performed to turn the Is- 
abelle formalization of the decision procedure described in [14] into our certifier 
FORTify. 

We use the FOL-Fitting library [3], which is part of the Archive of Formal 
Proofs,* to connect the first-order theory of rewriting and first-order logic. The 
translation is more or less straightforward. We interpret RR, constructions as 
predicates and RR2 construction as relations in first-order logic and prove both 
interpretations to be semantically equivalent: 


lemma eval_formula F Rs a f = 
eval a undefined (for_eval_rel F Rs) (form_of_formula f) 


With this equivalence we are able to define the semantics of formulas: 


definition formula_satisfiable where 
formula_satisfiable F Rs f <> (da. rangeaC Ta FA 
eval_formula F Rs a f) 


definition formula_unsatisfiable where 
formula_unsatisfiable F Rs fm > (formula_satisfiable F Rs fm = False) 


definition correct_certificate where 
correct_certificate F Rs claim infs n = 
(claim = Empty 4—> (formula_unsatisfiable (fset F) (map fset Rs) 
(fst (snd (snd (infs ! n))))) A 
claim = Nonempty <— formula_satisfiable (fset F) (map fset Rs) 
(fst (snd (snd (infs ! n))))) 


Last but not least we define the important function check_certificate which 
takes as input a signature, a list of TRSs, a boolean, a formula, and a certificate. 
This function first verifies that the given formula and the claim corresponds to 
the ones referenced in the certificate and afterwards checks the integrity of the 
certificate. The following lemmata, which are formally proved in Isabelle, state 
the correctness of the check_certificate function: 


lemma check_certificate F Rs A fm (Certificate infs claim n) = Some B 
= > fm = fst (snd (snd (infs ! n))) A A = (claim = Nonempty) 


lemma check_certificate F Rs A fm (Certificate infs claim n) = Some B 
= (B = True — correct_certificate F Rs claim infs n) ^ 
(B = False — correct_certificate F Rs (case claim of 
Empty = Nonempty | Nonempty = Empty) infs n) 


t https://www.isa-afp.org 
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The first lemma ensures that our check function verifies that the provided param- 
eters fm (formula) and A (answer satisfiable/unsatisfiable) match the formula 
and the claim stated in the certificate. The second lemma is the key result. It 
states that the check function returns Some True if and only if the certificate 
is correct. The only-if case is hidden in the last two lines. More precisely, if the 
claim of the certificate is wrong then negating the claim (the first-order theory 
of rewriting is complete) leads to a correct certificate. Therefore, if our check 
function returns Some None then the certificate is correct after negating the 
claim. 

Our check function returns None if the global assumptions (the input TRS is 
not linear variable-separated, the signature is not empty, etc.) are not fulfilled. 
We plan to extend the check_certificate function in the near future such that 
it reports these kind of errors. 

A central part of the formalization is to obtain a trustworthy decision pro- 
cedure to verify certificates. Hence we use the code generation facility of Is- 
abelle/HOL to produce an executable version of our check_certificate func- 
tion. Isabelle’s code generation facility is able to derive executable code for our 
constructions with the exception of inductively defined sets. In [8, Section 7] 
an abstract Horn inference system for finite sets is introduced to overcome this 
limitation. We use this framework to obtain executable code for the following 
constructions defined as inductive sets: 


reachable and productive states of a tree automaton, 

states of tree automata obtained by the subset construction, 

— epsilon transitions for the composition and transitive closure constructions 
of (anchored) GTTs, 

— an inductive set needed for the tree automaton for the infinity predicate. 


At this point we can use Isabelle’s code generation to obtain an executable check 
function. However, more effort is needed to obtain an efficient check function. 
Checking the certificate in Example 6 below did not terminate after more than 
24 hours computation time. We used the profiling capabilities of the Glasgow 
Haskell Compiler (GHC) to analyze the generated code. This revealed that most 
of the time was spent on checking membership. Since the computed tree au- 
tomata can grow very large, the use of lists as underlying data structure for sets 
in the generated code is a bottleneck. 

To overcome this problem we decided to use the container framework of 
Lochbihler [12]. In our case, the setup involved a non-trivial overhead as the 
container framework requires multiple class instances for data types used inside 
sets. Some of these instances could be derived automatically by the deriving 
framework of Sternagel and Thiemann [20]. Afterwards Isabelle’s code generation 
was able to generate a check_certificate function that uses red-black trees as 
underlying data structure for sets. 

Sadly, the function was still infeasible for the certificate in Example 6. This 
time the power set construction, which is exponential in worst case, turned out 
to be the culprit. In this construction we compute the transitive closure of the 
present epsilon transitions multiple times. Adding an explicit construction to 
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Fig. 1. Certificate validation with FORTify. 


remove epsilon transitions from tree automata solved this issue. To make a long 
story short, after further modifications we were able to verify the certificate for 
Example 6 in a little less than 3 minutes, which we consider fast enough for a 
first prototype. The resulting code-generated certifier is called FORTify. 

The overall design of FORTify is shown in Figure 1. It can be viewed as two 
separate modules A and B. Module B is the verified Haskell code base that is gen- 
erated by Isabelle’s code generation facility, containing the check_certificate 
function and the data type declarations for formulas and certificates. To use 
this functionality, we wrote a parser which translates strings representing for- 
mulas (signatures, TRSs, certificates) to semantically equivalent formulas (sig- 
natures, TRSs, certificates) represented in the data types obtained from the 
generated code. This was done in Haskell and refers to module A in Figure 1. 
Module A accepts formulas in FORT syntax. Hence it also applies the con- 
version to the de Bruijn representation. After the translation in module A, the 
check_certificate function in module B is executed and its output is reported. 

Importantly, the code in module A is not verified in Isabelle. Correctness of 
FORTify must therefore assume correctness of module A as well as the correct- 
ness of the Glasgow Haskell Compiler, which we use to generate a standalone 
executable from the generated code. 


6 FORT-h 


FORT-h is a new decision tool for the first order theory of rewriting. It is a 
reimplementation of the decision mode of the previous FORT tool [18] based on 
a modified decision procedure. The decision procedure, like the formalization, 
is based on anchored GTTs. The new tool is implemented in Haskell whereas 
FORT is written in Java. 

FORT-h supports all features of FORT while extending the domain of sup- 
ported TRSs from left-linear right-ground TRSs to linear variable-separated 
ones. While FORT could technically take such TRSs as input, it is unsound 
when checking non-ground properties on them. 
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Fig. 2. Interface of FORT-h. 


Example 5. To check confluence of the linear variable-separated TRS 


g(g(x)) > gly) a > g(a) 
FORT-h can be called with 


> ./fort-h "CR" input.trs 
NO 


where input.trs is a text file containing the rewrite system. The tool correctly 
states that NO the system is not confluent. However, FORT incorrectly identifies 
this as confluent due to the lack of support for variables appearing in right-hand 
sides of rules. 


FORT-h took part in the 2020 edition of the Confluence Competition, com- 
peting in five categories: COM, GCR, NFP, UNC and UNR. Even though it does 
not support many problems tested in the competition, due to the restriction to 
linear variable-separated TRSs, it was able to win the category for most YES 
results in UNR. The tool expects as input a formula y and one or more TRSs, as 
seen in Figure 2. It then outputs the answer YES or NO depending on whether 
y is satisfied or not by the given TRSs. FORT-h may be passed some additional 
options: 


-c FILE: causes FORT-h to write a certificate to the given FILE, 

-i: enables the additional (info) in the inference steps in the certificate, 

-v: enables verbose output (e.g. showing the internal formula representation). 
-w: enables witness generation. 


As an example of the latter, consider Example 5 and the call 


> fort-h -w "CR" input.trs 

NO 

formula body / witness: 
(0 (<- o ->*) 1 & ~ 0 (->* o *<-) 1) 
0 = g(_00()) 
1 = g(_010) 


So in addition to the answer NO, it also outputs a counter example for the given 
formula consisting of the two terms g(_00()) and g(_01()). Here _00 and _01 
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are additional constants required to reduce confluence to ground-confluence, and 
represent variables. The terms should therefore be read as g(x) and g(y). 
Internally FORT-h represents formulas using de Bruijn indices as described 
in Section 4. Additionally, universal quantifiers and implications are eliminated, 
and negations are pushed as far as possible to the atomic subformulas. The 
tool then traverses the formula in a bottom-up fashion, constructing the corre- 
sponding anchored GTTs and RR,, automata. During this traversal we also keep 
track of the steps taken, to construct the certificate if necessary. To improve 
performance the automata are cached and reused for equal subformulas. The 
tree automaton representing the whole formula is then checked for emptiness. If 
the accepted language is empty, FORT-h reports NO, otherwise it outputs YES. 


7 Experiments 


The experiments described in this section were run on a computer with a Intel(R) 
Core(TM) i7-5930K CPU with 6 cores at 3.50GHz. 

In the 2019 edition of the Confluence Competition [15] three tools contested 
the commutation (COM) category:° ACP [1], CoLL [19], and FORT. On input 
problem COPS #1118 the tools gave conflicting answers. 


Example 6. COPS #1118 is about the commutation of the TRSs COPS #669 
asc f(a) +b b—+b b — h(b, h(c, a)) 
and COPS #695 
h(a,a) => c b — h(b,a) ba f(c) +c ca 


To determine the correct answer we use FORT-h to produce a certificate for 
ground-confluence by calling 


> fort-h -c cert -i "GCom([0],[1])" 1118.trs 
YES 


This produces the following certificate: 


(O (rr2 (comp (inverse (step* (1))) (step* (0))) 0 1) 

(rr2 (comp (inverse (step* (1))) (step* (0))) O 1) 

(size 13 53 0)) 
(1 (rr2 (comp (step* (0)) (inverse (step* (1)))) 0 1) 

(rr2 (comp (step* (0)) (inverse (step* (1)))) O 1) 

(size 11 47 0)) 
(2 (not 1) (not (rr2 (comp (step* (0)) (inverse (step* (1)))) 0 1))) 
(3 (and (0 2)) 

(and ((rr2 (comp (inverse (step* (1))) (step* (0))) 0 1) 

(not (rr2 (comp (step* (0)) (inverse (step* (1)))) 0 1))))) 

(4 (exists 3) 


5 https: //cops.uibk.ac.at /results/?y=2019&c=COM 
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Table 1. FORT(-h) run on GCR formulas with a 60s timeout (FORTify with 600s). 


YES @-time vY NO @-time vY oe) total (W) time 


(1) FORT-h 36 0.26s 10 84 0.56s 16 2 176.238 (17.6h) 
FORT 37 0.31s — 82 0.52s — 3 234.08s 

(2) FORT-h 37 1.48s 10 84 0.09s 16 1 122.55s (17.8h) 
FORT 37 0.32s — 82 0.50s — 3 233.20s 

(3) FORT-h 36 0.45s 6 83 0.08s 9 3 202.64s (18.2h) 
FORT 37 0.32s — 82 0.55s — 3 236.69s 


(exists (and ((rr2 (comp (inverse (step* (1))) (step* (0))) O 1) 
(not (rr2 (comp (step* (0)) (inverse (step* (1)))) 0 1)))))) 
(5 (exists 4) 
(exists (exists (and ((rr2 (comp (inverse (step* (1))) 
(step* (0))) 0 1) (not (rr2 (comp (step* (0)) 
(inverse (step* (1)))) 0 1))))))) 
(6 (not 5) 
(not (exists (exists (and ( 
(rr2 (comp (inverse (step* (1))) (step* (0))) O 1) 
(not (rr2 (comp (step* (0)) (inverse (step* (1)))) 0 1)))))))) 
(7 (nnf 6) 
(forall (forall (or ( 
(not (rr2 (comp (inverse (step* (1))) (step* (0))) O 1)) 
(rr2 (comp (step* (0)) (inverse (step* (1)))) 0 1)))))) 
(nonempty 7) 


When passing this certificate to FORTify, after 2 minutes and 57 seconds the 
output Certified is produced, so we can be assured that the TRSs do commute. 
Note that the inference steps 0 and 1 contain the optional size information. Here 
(size k m n) means the underlying RR, automaton constructed by FORT-h 
contains k final states, m transitions, and n epsilon transitions. 


We also ran some experiments comparing FORT-h to FORT. The problems for 
these experiments are taken from the Confluence Problems database (COPS), 
and consists of 122 left-linear right-ground TRSs. Note that FORT-h imple- 
ments no parallelism, while FORT does. For the first two experiments we chose 
a timeout of 60 seconds for the decision tools and 600 seconds for FORTify. The 
formulas were taken from the experiments reported in [17]. The first three 


VsVtVu(s>*tAs>*u => tlu) (1) 
VsVtVu(s>*tAsou = tlu) (2) 
VtVu(to* u => tlu) (3) 


denote different but equivalent formulations of ground-confluence (GCR). 

The results are shown in Table 1, where the YES (NO) column shows the 
number of systems determined to be (non-)ground-confluent together with av- 
erage time (@-time) the tool took. The oo column is the number of timeouts. 
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To compare overall performance the total time column contains the sum of all 
runtimes, including timeouts but excluding the time taken by FORTify. The 
Y columns show the numbers of certifiable results as well as the overall time 
taken by FORTify (W-time). These results show that, even though they have the 
same meaning, the choice of formula has an impact on performance. Interest- 
ingly FORT-h is generally faster and can solve more problems than FORT even 
though it can not take advantage of any parallelism. This performance advan- 
tage is more prominent in systems which are non-confluent. For problems with 
the answer YES, FORT can still prove more. The table also shows that FORTify 
can only certify a small portion the results. This is due to the performance of 
the certifier, since all other problems time out. It is also apparent that formulas 
containing conversion (<+*) are especially slow. No wrong results by the decision 
tools where identified. 

The second set of formulas represents the normal form property, restricted 
to ground terms (GNFP): 


VtVu(to* uANF(u) => t>* u) (4) 
VsVtVu(sotAso'u => t* u) (5) 
Vt (WN(t) = CR(#)) (6) 


The results for these are shown in Table 2. The same pattern is observed, where 
even though both can (dis)prove satisfaction for the same formulas, FORT-h is 
faster overall. 

For the last experiment we test performance on properties over two TRSs. 
This is done by checking ground-commutation (GCOM) for all pairs of systems 
form the dataset, resulting in 7503 problems. A timeout of 60 seconds was used. 
The results, presented in Table 3, show that FORT-h is ahead here as well, 
(dis)proving more problems and doing so in significantly less time. 

Full details of the experiments are available from the website? accompanying 
this paper. Precompiled binaries of FORT-h and FORTify are available from the 
same site. We also present a few additional experiments with FORTify. 


ê https: //fortissimo.uibk.ac.at /tacas2021 


Table 2. FORT(-h) run on GNFP formulas with a 60s timeout (FORTify with 600s). 


YES @-time vY NO @-time vY o0 total (W) time 

(4) FORT-h 59 0.70s 31 63 0.07s 20 0 45.62s (14.6h) 
FORT 59 0.23s — 63 0.39s — 0 38.16s 

(5) FORT-h 59 0.03s 46 63 0.01s 50 0 2.558 (6.3h) 
FORT 59 0.22s — 63 0.30s — 0 31.83s 

(6) FORT-h 59 0.05s 42 62 0.12s 45 1 70.51s (8.6h) 
FORT 59 0.31s — 62 0.64s — 1 117.86s 
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Table 3. FORT(-h) run on GCOM with a 60s timeout (FORTify with 600s). 


YES @-time Wo NO ©-time Vv oo total (W) time 
FORT-h 1381 0.16s 878 6120 0.03s 3666 2 517.32s (681.5h) 
FORT 1354 146s — 6100 0.94s = 49 10670.89s 


8 Conclusion 


In this paper we presented FORTify, a certifier for the first-order theory of rewrit- 
ing for linear variable-separated TRSs, together with an expressive certificate 
language for formulas and proofs. Moreover, a new implementation of the de- 
cision procedure for the theory of rewriting, FORT-h, is capable of producing 
certificates in this language. 

We mention three topics which require further research. First of all, many 
certificates produced by FORT-h cannot be validated by the current version of 
FORTify within reasonable time. We will further improve the algorithms and 
data structures used in the check-certificate function. A natural candidate 
for optimization is the transitive closure algorithm generated by Isabelle, which 
always takes cubic time. Currently, sharing only takes place in the inference 
rules. Expanding this to the individual constructions will be the next step. Also 
trimming of anchored GTTs could improve the run time. In the current state of 
the formalization only trimming of GTTs is proved to be sound. Profiling will 
be used to determine other candidates that are likely to have a large impact on 
the validation time. 

A second topic for future research is the certification of properties on open 
(i.e., non-ground) terms. In [8, 16,18] conditions are presented to reduce proper- 
ties related to confluence to the corresponding properties on ground terms, by 
adding additional constants to the signature. These results need to be formalized 
in Isabelle and the certificate language needs to be extended, before FORTify can 
be used to certify the corresponding categories in the Confluence Competition. 
We plan to define signature extensions directly in formulas, to offer the most 
flexibility. A related issue is the support for many-sorted signatures in the Is- 
abelle formalization. FORT-h already supports many-sorted TRSs, which is the 
format in the GCR category of CoCo. 

A third topic is improving the efficiency of FORT-h. We anticipate that sup- 
porting parallelism will further speed up FORT-h, especially for large formulas. 
Preprocessing techniques that go beyond the mere transformation to negation 
normal form will be helpful to obtain equivalent formulas that reduce the size 
of the ensuing tree automata in the decision procedure. In [10] similar ideas are 
applied to WSKS, in connection with MONA [11]. 


Acknowledgments. We thank René Thiemann for giving valuable advice on how 
to improve the efficiency of the generated code. The comments by the anonymous 
reviewers improved the presentation. 
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Abstract. This paper presents a novel approach for quantifier instan- 
tiation in Satisfiability Modulo Theories (SMT) that leverages syntax- 
guided synthesis (SyGuS) to choose instantiation terms. It targets quan- 
tified constraints over background theories such as (non)linear integer, 
reals and floating-point arithmetic, bit-vectors, and their combinations. 
Unlike previous approaches for quantifier instantiation in these domains 
which rely on theory-specific strategies, the new approach can be applied 
to any (combined) theory, when provided with a grammar for instantia- 
tion terms for all sorts in the theory. We implement syntax-guided instan- 
tiation in the SMT solver CVC4, leveraging its support for enumerative 
SyGuS. Our experiments demonstrate the versatility of the approach, 
showing that it is competitive with or exceeds the performance of state- 
of-the-art solvers on a range of background theories. 


1 Introduction 


Modern Satisfiability Modulo Theories (SMT) solvers are highly efficient tools, 
capable of reasoning about constraints over a wide range of logical theories, 
including (non-linear) real and integer arithmetic, fixed-size bit-vectors, and 
floating-point arithmetic. Their core algorithms are designed primarily for quan- 
tifier-free constraints, but various extensions have been shown to work well also 
for quantified constraints in many cases. Quantified reasoning in SMT has many 
practical applications, including software verification, automated theorem prov- 
ing, and synthesis. 

Current SMT solvers handle quantified constraints in a variety of ways, with 
a degree of effectiveness that usually depends on the background theory. For 
instance heuristic instantiation techniques such as E-matching [15] are used for 
quantified formulas with heavy use of uninterpreted functions. These heuristic 
instantiation techniques are refutationally incomplete but they can be highly 
effective, in particular in the context of verification applications. For quantified 
constraints over a particular background theory, such as linear arithmetic or 
fixed-size bit-vectors, on the other hand, SMT solvers resort to an entirely dif- 
ferent set of techniques. While also based on quantifier instantiation, these other 


* This work was supported in part by DARPA (award no. FA8650-18-2-7861), NSF 
(award no. 1656926) and ONR (award no. N68335-17-C-0558). 
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techniques tend to be counterexample-guided and can be complete for theories 
and fragments of first-order logic that admit quantifier elimination. 

Specific previous work in the latter direction includes counterexample-guided 
quantifier instantiation techniques for linear arithmetic [25] and fixed-size bit- 
vectors [18,20]. The key to developing each of them is to devise an appropriate, 
theory-specific selection function, which determines a term selection strategy for 
instantiating universal quantifiers. For some logics, e.g., linear arithmetic, se- 
lection functions can be based on the notion of elimination set found in classic 
algorithms for quantifier elimination [9,14]. However, since many theories used 
in practice do not admit quantifier elimination, the design of a good selection 
function is usually non-trivial. These challenges are further magnified when rea- 
soning in combinations of multiple theories. 

We propose a novel, syntaz-guided quantifier instantiation (SyQI) approach, 
which is both general-purpose and highly effective for quantified formulas in 
background theories such as (non)linear integer, reals and floating-point arith- 
metic, and their combinations. The new approach leverages an embedding of a 
solver for the syntax-guided synthesis (SyGuS) problem [1] within an SMT solver 
in order to choose terms for quantifier instantiation in a counterexample-guided 
manner. It is theory-agnostic and only requires the specification, via a grammar, 
of the set of terms to consider for each sort in the theory when instantiating quan- 
tifiers.? Since it can be applied to quantified formulas in any background theory, 
it is more general in scope than previous work [20]. Our approach is intended 
for logics such as quantified floating-point arithmetic, which would benefit from 
counterexample-guided quantifier instantiation, but for which appropriate selec- 
tion function are not obvious. We show that the use of syntax-guided synthesis 
gives us the flexibility to develop variants of our approach that are highly com- 
petitive with the state of the art in SMT solving. More specifically, this paper 
makes the following contributions: 


— We present and prove correct a simple yet novel quantifier instantiation 
approach that leverages syntax-guided synthesis for selecting instantiations. 

— We explore variants of the approach along several dimensions, including the 
choice of symbols to include in grammars for various background theories. 

— We implement this technique in the SMT solver CVC4 [5] and show that 
it performs remarkably well in a wide variety of SMT logics. In particular, 
it improves upon the state of the art for solving quantified formulas over 
floating-point arithmetic, and is highly competitive for non-linear integer 
arithmetic and certain combined logics that involve fixed-size bit-vectors. 


Related Work. Handling quantified formulas in SMT solvers is a long-standing 
challenge. Early approaches for quantified formulas were largely based on E- 
matching [8, 10,15]. They have been later supplemented with techniques that 
rely on models for establishing satisfiability [11,26], and on conflict finding to 
accelerate the search for unsatisfiability [27]. Pragmatic enumerative approaches 


3 Our implementation provides a default grammar for all supported sorts. In general, 
grammars can also be provided by the user. We do not explore this option here. 
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for quantifier instantiation have also been explored and shown to increase the 
precision of SMT solvers on inputs involving uninterpreted functions where E- 
matching is incomplete [21]. The approach we describe here is also enumerative in 
nature; however, it leverages syntax-guided synthesis for choosing instantiations 
and does not target inputs with uninterpreted functions. 

For quantified formulas over a single background theory, counterexample- 
guided approaches have been considered by Bjørner and Janota [6] and by 
Reynolds et al. [25], targeting primarily quantified linear integer /real arithmetic. 
For theories of other data types (e.g., bit-vectors), most approaches use value- 
based instantiation, where concrete variable assignments for a set of quantifier- 
free formulas derived from the negation of the input formula (the counterexam- 
ples) provide instantiations for the universal variables. In the SMT solver Z3 [16], 
model-based quantifier instantiation (MBQI) [11] is combined with a template- 
based model finding procedure [29]. A recent line of work by Niemetz et al. [18] 
leverages invertibility conditions in a counterexample-guided loop for quantifier 
instantiation of formulas in the theory of fixed-size bit-vectors. Brain et al. [7] lift 
the concept of invertibility conditions to the theory of floating-point arithmetic 
and presented a preliminary quantifier elimination procedure for a fragment of 
the theory based on these conditions. Another approach for lazy quantifier elim- 
ination for bit-vector formulas is explored by Vediramana Krishnan et al. [12], 
based on iterative approximate quantifier elimination. 

Reynolds et al. [24] leverage counterexample-guided quantifier instantiation 
(CEGQI) to efficiently solve a restricted but practically useful form of syntax- 
guided synthesis problems. In contrast, the work in this paper has the dual goal 
of leveraging enumerative syntax-guided synthesis to establish a strategy for 
quantifier instantiation of (first-order) quantified formulas. 

SyGuS techniques to solve quantified problems were previously explored by 
Preiner et al. in [20]. However, instead of focusing on quantifier instantiation 
they combined enumerative syntax-guided synthesis with value-based quantifier 
instantiation to synthesize Skolem functions for existential variables. 


2 Background 


We assume the usual notions and terminology of many-sorted first-order logic 
with equality (denoted by ~). Let S be a set of sort symbols. For every o € S, 
let Xo be an infinite set of variables of sort o. Let X = Uses Xo. Let X be a 
signature consisting of a set © C S of sort symbols and a set © of interpreted 
(and sorted) function symbols f7!°"?"? with arity n > 0 and o4,...,0n,0 E X5. 
We assume that X includes a Boolean sort Bool and the Boolean values T (true) 
and L (false). Let Z be a X’-interpretation that maps: each sort o € X° to a non- 
empty set o? (the domain of T), with Bool” = {T, L}; each variable £ € Xo 
to an element x? € o7; and each function f71""°7? € Xf to a total function 
ft: oF x... x o£ — ot ifn > 0, and to an element in o? if n = 0. 

We assume the usual definition of well-sorted terms, literals, and formulas 
as Bool terms with variables in X and symbols in X, and refer to them as X- 
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terms, Y-atoms, and so on. A ground term/formula is a X-term/formula without 
variables. We define a = (x1, ..., £n) as a tuple of variables and write Qx.y with 
Q € {V,5} for a quantified formula Qx,.---Qan.y. A formula is universal if 
it has the form Va. P where P is a quantifier-free formula. For simplicity, we 
consider only universal quantifiers since existential quantifiers can be rewritten 
in terms of universal ones. We use Lit(y) to denote the set of X-literals of X- 
formula y. For a X-term or X-formula e, we use e[a] to indicate that the free 
variables of e are in æ. For a tuple of X-terms t = (t1,..., tn), we write eft] for the 
term or formula obtained from e by simultaneously replacing each occurrence 
of x; in e by ti. If t is a ©-term/formula and Z a X-interpretation, we write 
t? to denote the meaning of t in Z. We use the usual inductive definition of a 
satisfiability relation = between »/-interpretations and »/-formulas. 

A theory T is a pair (X, I), where X is a signature and J is a non-empty class 
of ’-interpretations (the models of T) that is closed under variable reassignment, 
i.e., every Y/-interpretation that only differs from an Z € I in how it interprets 
variables is also in J. A X-formula y is T-satisfiable (resp. T-unsatisfiable) if it 
is satisfied by some (resp. no) interpretation in I; it is T-valid if it is satisfied 
by all interpretations in J. 


Enumerative SyGuS using an Embedding into Datatypes. A syntax-guided syn- 
thesis problem for an n-ary function f in a background theory T consists of 
a set of semantic restrictions (a specification) for f, given as a (second-order) 
T-formula of the form If. y[f], and a set of syntactic restrictions on the solu- 
tions for f, typically expressed as a context-free grammar. A solution to such 
a problem is a term t|z1,..., £n] that satisfies the syntactic restrictions and is 
such that the formula y[Av1,...,@n.t] is T-valid. 

As shown in previous work [24], syntactic restrictions for the bodies of func- 
tions to synthesize can be conveniently represented as a set of (algebraic) data- 
types. The setting in this paper is simpler. Instead of synthesizing terms cor- 
responding to function bodies, we use context-free-grammars for defining a set 
of (first-order) terms in a given theory, possibly containing free function sym- 
bols. For instance, let a and b be free constants of sort Int. The context-free 
grammar R below specifies a set of integer (Z) and Boolean (B) terms: 


Z:=0|1ļaļb| Z+Z| Z-Z | ite(B,Z,Z) (1) 

B:=B>B| ZxZ|-=-B]| BAB (2) 

Given such a grammar, our SyGuS solver generates the following mutually re- 
cursive datatypes: 

Z=zero | one | a | b | plus(Z,Z) | minus(Z, Z) | ite(8,Z,Z) (3) 

B = geq(Z, Z) | eq(Z,Z) | not(B) | and(B,B) (4) 

Each datatype constructor, listed on the right-hand side of each equation, corre- 

sponds to a production rule of R, e.g., plus corresponds to the rule Z := Z + Z. 


Given a datatype value v, we write to_term(v) to denote the term that v rep- 
resents, e.g., to_term(plus(a, b)) is the term a + b. 
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In previous work [22,24], a smart enumerative approach for syntax-guided 
synthesis was presented and implemented in CVC4. In that work, the generation 
of terms is based on finding solutions for an evolving set of constraints in an 
extension of the quantifier-free fragment of algebraic datatypes, for which some 
SMT solvers have dedicated decision procedures [3,23]. In the remainder of 
this paper, we write Tp to denote the theory of datatypes over a signature Xp 
of constructor and selector symbols. The signature Xp includes (parametric) 
datatype sorts that are interpreted as the universe of a term algebra over the 
constructors. Selectors are interpreted as functions that extract the immediate 
subterms of a constructor term. 

In our setting, datatype constraints are used to express syntactic restrictions 
on the terms in the original theory. For instance, in case of the example theory 
and corresponding datatypes Z and 6 defined above, we can write a datatype 
constraint that is falsified by all terms of the form plus(zero,t) where ¢ is a 
constructor term of sort Z. This corresponds to ruling out terms of the form 
(0+...) in the original theory where s is a term of sort Int. In more detail, for a 
datatype term d, we write isc(d) to denote the discriminator predicate, which is 
satisfied exactly when d is interpreted as a datatype value whose top constructor 
is C. We write sel,,,(d) to denote a shared selector [28] applied to d, interpreted 
as the nt”? child of d with sort ø if one exists, and as an arbitrary element of 
o otherwise. These symbols are used for constructing blocking constraints. For 
example, we can write —ispius(d) V 7iszero(Selz,1(d)) to state the constraint above 
that d cannot be interpreted as any datatype value corresponding to an Int term 
of the form (0+...). In the context of syntax-guided synthesis, a constraint like 
this is added, for instance, to filter out redundant terms (like 0+...) or terms 
already known to falsify the synthesis conjecture. 

Our approach for syntax-guided instantiation relies on a notion of evaluation 
variables. A related, more general, notion of evaluation functions was used in 
the context of syntax-guided synthesis (see Section 2 of [22] for details). Let d 
be a term of a datatype sort encoding a grammar over terms of sort ø. We write 
eq to denote a free constant of sort 0, which we call the evaluation variable for 
d. We use evaluation variables to determine which terms to use in instantiations 
of quantified formulas. The algorithm given in the following section will add 
constraints that force the interpretation of eg to be equal to to_term(d7) in 
interpretations Z. A simple example of such a constraint is is,(d) > ea ~ a, 
stating that the evaluation variable eg for d is equal to the free constant a of 
integer type when d is interpreted as the datatype value a. 


3 SyGuS Quantifier Instantiation (SyQI) 


Our new SyGuS-based instantiation approach combines counterexample-guided 
quantifier instantiation (CEGQI) with smart enumerative SyGuS techniques to 
synthesize terms for quantifier instantiation. In essence, it is an algorithm that 
tries to synthesize a term t for a variable x in a given formula Vx. P[x] such that 
—=P{t] holds. For synthesis purposes, each quantified variable is associated with 
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Algorithm 1 Main algorithms of the SyQI approach. 


1: procedure syqi({Qi,...,Qn},G) 
2: for Q; € {Q1,...,Qn} with Q; = Va. P[a] do 


3: for x € x do 

4: Let dz be a fresh global constant of datatype sort grammar s(x) 
5: G := GU {lj > 7Plea,]|} with fresh Boolean constant lj and fresh eq, 
6: repeat 

T: if check(G) = unsat then return unsat 

8: r,Z := check(G A (l V... V In)) 

9: if r = unsat then return sat 

10: for lj € {li,...,In} such that If = T do 

11: G = G U select lemmas (Q;, T) 

12: procedure select-lemmasz (Yz1,..., £p. P[£1,..., £p], T) 

13: D= 

14: for z; € {r1,...,vp} do 

15: ti = to_term(dz, ) 

16: L:= LU {explain(dz, ~ dZ,) > ed,, © to_term(d;, )} 


17: return non-empty subset of {P[t1,...,tp]} U L based on selection strategy £ 


a SyGuS grammar based on the sort of the variable. For example, our algorithm 
uses a bit-vector-specific grammar to synthesize bit-vector terms as possible in- 
stantiations of quantified variables of bit-vector sort. Our SyGuS solver suggests 
instantiations based on such grammars and an evolving set of constraints on 
the instance term. The main advantage of this instantiation approach is that 
it does not require theory-specific quantifier instantiation algorithms. Its only 
theory-specific aspects are the construction of the grammar for each theory sort 
and the satisfiability checks performed on the generated instances. 

Algorithm 1 shows the two main procedures syqi and select_lemmase of 
our SyGuS instantiation approach. To simplify the exposition, we describe the 
restricted case where the quantified input formula are all universal. Our imple- 
mentation in CVC4, however, applies to the general case through a lazy conver- 
sion to DNF and resolution of quantifier alternations. 

Procedure syqi takes as argument a set {Q1,..., Qn} of universal (quanti- 
fied) T-formulas and a set G of ground T-formulas. As an initial step, and prior 
to solving the problem, we generate a lemma for each quantified formula Q; as 
part of our counterexample-guided quantifier instantiation approach (lines 2-5). 
We first create a fresh datatype constant dy of sort grammar s(x) for each vari- 
able x € a in each input formula Vx. Pia]. The datatype sort grammar s(x) 
is constructed from a SyGuS grammar determined by the sort of variable x. 
The language generated by the grammar includes ground terms from Q; and 
G of the same sort. These terms are chosen following a selection strategy S, 
which we describe in Section 3.1. Apart from running check, used as a black 
box, grammar s implements the only theory-specific handling of our procedure. 
Finally, we add to G a lemma of the form l; > —Pleg,]| for each quantified for- 


Syntax-Guided Quantifier Instantiation 151 


mula, where l; is a fresh Boolean constant (the counterexample literal for Q;). 
Thanks to l; being fresh, this preserves the satisfiability of G. The notation eg, 
is a shorthand for (eq,,,---,€d,,,), the tuple of evaluation variables for each dy 
of x € a. The purpose of a counterexample lemma is twofold. First, it indicates 
whether a quantified formula Q; is active (l; assigned to true) or inactive (l; 
assigned to false). Second, it focuses on finding counterexamples that falsify the 
body of Qi. 

The main loop of procedure syqi is provided in lines 6-11. Each iteration 
starts with a quantifier-free satisfiability check (performed by procedure check 
on line 7) on the current set of ground formulas G in the combined theory 
T U Tp. If G is unsatisfiable, procedure syqi returns unsat. If G is satisfiable, 
the procedure further checks whether it can find a counterexample for any of the 
quantified formulas Q1,...,Q,, which is done by checking the satisfiability of 
GA(V...Vl,). If the check returns unsat then no more counterexamples can be 
found; the algorithm concludes that input set is satisfiable and returns sat. The 
reason is that, in this case, the set G is satisfiable and entails each input formula, 
as proven later in this section. If the second call to check (line 8) returns sat, it 
additionally returns (a finite representation of) a model Z for the current set of 
ground formulas G. Since Z satisfies lı V ... V ln, it does not satisfy at least one 
quantified formula in Q,,...,Qn-4 For each active quantified formula in Z, we 
generate new lemmas via procedure select_lemmas, (lines 10-11), and repeat 
the main loop of the algorithm. Note that the second satisfiability check can be 
avoided by employing a special decision heuristic for counterexample literals 1; 
in the SAT solver. The decision heuristic will always assign a counterexample 
literal l; to true on a decision. Consequently, l; can only be assigned to false in 
a candidate interpretation Z if ~l; is entailed by the set of ground formulas G. 

Procedure select_lemmasz takes a formula Vx. P{a] and a model Z as ar- 
guments and generates a set of lemmas based on Z and selection strategy £. 
The procedure maintains the invariant of always returning a set of lemmas L 
where L \ G is non-empty. This set L includes a single instantiation lemma (of 
the form P{t]) and an evaluation unfolding lemmas (see below) for each variable 
x € x. The returned lemmas are generated based on one of three lemma selec- 
tion strategies: priority-inst, priority-eval, and interleave. Strategy interleave selects 
both the instantiation lemma and a set of evaluation unfolding lemmas at the 
same time. Strategies priority-inst and priority-eval give priority to instantiation 
lemmas and evaluation unfolding lemmas, respectively; i.e., strategy priority-inst 
selects the instantiation lemma and only selects evaluation unfolding lemmas if 
the instantiation lemma was already in G. Analogously, priority-eval gives priority 
to evaluation unfolding lemmas. 

The various lemmas are constructed as follows. For each variable x € x we 
use the model value dZ of datatype constant dą to construct the corresponding 
term to_term(d<) in the theory of variable x (line 15). The constructed term 
corresponds to a term synthesized by the SyGuS extension of our datatypes 


4 Note that this does not mean the quantified formula is unsatisfiable, only that it is 
not satisfied in Z. 
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solver based on the grammar specified for x. To ensure that dẹ, evaluates to 
the same values as term to_term(dZ) under model value dZ, we generate the 
evaluation unfolding lemma explain(d, ~ dz) > eg, ~ to_term(d2). The 
explanation for the model value dz is expressed in terms of discriminator pred- 
icates. For example, if value dZ represents term a + b, the procedure gener- 
ates lemma ispius(dz) A iSa(selz,1(dz)) A isp(selz,2(dz)) > ea, = a + b. As a last 
step, select_lemmas, selects a non-empty subset of the generated instantiation 
lemma P[ti,...,tp] (where each t; is toterm(dZ, )) and the evaluation unfolding 
lemmas L according to the lemma selection strategy £. 

We now discuss the correctness properties of our approach. In the following, 
we say a grammar R for sort ø is complete, if for all interpretations Z and values 
v of sort ø, it generates at least one term t such that t? = v. Note that we only 
consider complete grammars in this paper. We say a lemma selection strategy £ 
is fair wrt a set of formulas G if it returns a set of lemmas that contain at least 
one lemma inequivalent to each formula in G whenever such lemma exists. 


Theorem 1. Let T be a theory with signature X, let F be a set of universal for- 
mulas {Q1,..., Qn} and Go is a set of quantifier-free formulas. If all grammars 
constructed by the calls to grammars in syqi are complete and the selection 
strategy L used for select_lemmasz is fair, then the following statements hold: 


1. (Refutation Soundness) If syqi(F,Go) returns unsat, F U Go is T-unsatis- 
fiable. 

2. (Model soundness) If syqi( F, Go) returns sat, F U Go is T-satisfiable. 

3. (Progress) Let Gi be the current state of the set of ground formulas G after 
i iterations of syqi (lines 6-11). Each iteration i + 1 adds at least one new 
formula to Gi, so that Gi41\ Gi 490. 


Conceptually, the proof of refutational soundness relies on the fact that all 
lemmas added to G are entailed by the input or maintain equisatisfiability with 
respect to the input. The proof of model soundness relies on the fact that when 
G collectively entails the negation of (all) quantified formulas, then the current 
model Z for G must be a model for all quantified formulas. Procedure syqi is 
not terminating in general. However, the progress property guarantees that the 
algorithm does not get stuck in a single state and keeps making progress towards 
refining the set of possible models by ruling out at least one candidate model at 
each iteration of the procedure’s main loop. 


Proof. For brevity, we show these statements for the case of n = 1 and where Qı 
is Va. P[a]; the proof can be easily lifted to n > 1. When syqi(F’, Go) terminates, 
the internal set G is the union of: 


— The initial quantifier-free formula Go, 

— The counterexample lemma Geer of the form l = —P{eg,]| added on line 5, 
— A set of instantiations Ging: of the form P[t], and 

— A set of evaluation lemmas Ge, of the form C[d] => eg 7 t. 
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To show (1), assume that ¢ is satisfied by some »-interpretation J, where 
without loss of generality assume that IZ is false. Let Z bea NUD p-interpretation 
that extends J such that for each evaluation variable eg, the interpretation of 
d in T is such that to_term(d7)? = ef. Such a value exists since our grammars 
are complete by assumption. We show that Z satisfies each formula w in G. If 
w € Go, then this holds since J satisfies p, and hence, by extension Z does 
as well. If Y € Geer, then w is satisfied by Z since it interprets l; as false. If 
w € Ginst is an instantiation lemma of some Q;, then it is satisfied by Z since 
J also satisfies Q;. If Y E€ Gey is an evaluation lemma, this is satisfied by our 
construction of d7. Thus y is T-satisfiable, then G must be (T U Tp)-satisfiable. 
Thus, since syqi(F, Go) returns unsat when G is (T U Tp)-unsatisfiable, this 
means that F U Gp must be T-unsatisfiable as well. 

To show (2), if syqi(F,Go) returns sat, then the set G is satisfied by some 
SUX p-interpretation and GU {I} is unsatisfiable. Let 7 be the Y-interpretation 
that interprets all symbols in X the same as in Z. Since GU {l,} is unsatisfiable, 
we have that Go U Ginst U Gev U {>P[ea, |} is T U Tp-unsatisfiable. Since all X- 
interpretations can be lifted to a X U X/p-interpretation satisfying Gev, it must 
also be the case that Go U Ginst U{P|ea, |} is T-unsatisfiable. Hence, all models 
of Go U Ginst must make Pleg,] true. Since eg, does not occur in Go U Ginst, 
this implies that all models of Go U Ginst satisfy Va. Pia]. Since Go U Ginst C G 
and T satisfies GŒ, we have that J satisfies {Vx.P[a]} UG. 

To show (3), assume ad absurdum that G is satisfied by a T U Tp-interpre- 
tation Z where to_term(d,”) = t and Qı is active in Z. Also assume that G 
contains the evaluation unfolding lemmas for dx” and the instantiation lemma 
P|t]. Due to the former, we have that eg,7 = t”. Since Q, is active in Z, T satis- 
fies —Pleg, |]. However, P[|t] is also satisfied by Z, a contradiction. Thus, at least 
one of the lemmas returned by select_lemmas,; for Qı must be inequivalent to 
the lemmas in G, due to our assumption that £ is a fair selection strategy. 


3.1 Grammar Construction 


For quantifier instantiation, we focus on the theories of fixed-size bit-vectors, 
floating-point numbers, integers, and reals as defined by the SMT-LIB 2 stan- 
dard [4]. The signature of the theory of fixed-size bit-vectors includes a unique 
sort for each positive bit-vector width n, denoted here as BV/,). The signature 
of the theory of floating-point numbers includes a rounding-mode sort RM and 
a unique floating-point sort for each combination of positive exponent width e 
and significand width s, denoted here as FP;,,.). The theories of Integers and 
Reals include the integer sort Int and the real sort Real, respectively. For each 
of these sorts we define a SyGuS grammar that includes the following operators 
and constants. 


Rav: {~,-,&,|,8,+,:, +, +s, mod, mod; <, >>, >>a, 0,1, ones, smin, smax} 


Rrr : {—, abs, rem, ,/,rti, +, -, +, fma, NaN, +00, +0, +min®*, +max*, +min”, --max”} 


Rew : {RNA, RNE, RTE, RTP, RTZ} Rm: {+,—,0,1} Raat {+, —, +, 0,1} 
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Theory Symbol SMT-LIB Syntax Sort 
~,- bvnot, bvneg BV in] + BV in) 
&, |, ® bvand, bvor, bvxor BV inj x BVin] > BV inj 
BV <> Sa bvshl, bvlshr, bvashr BV inj x BV inj > BV[n} 
+ —,- bvadd, bvsub, bvmul BV inj x BVin > BV[n} 
+, +s, mod, mods bvudiv, bvsdiv, bvurem, bvsrem BV inj x BVin] > BV [nj 
—,abs p-neg, fp.abs FP {e,s] + FP {e,s] 
rem p-rem FPe,s] X FP [e,s] FP ie, s] 
FP V rti p.sqrt, fp.roundToIntegral RM x FP{c,s] > FP{e,s] 
ea p.add, fp.mul, fp.div RM XFPye,s] x FP[e,s] + FP fe,s] 
fma p-fma RM x FP{e,s] x FP ie, s] x FP{c,s] + FPie,s] 
Ints +,- +, — Int x Int Int 
Reals +, —, + +,-,/ Real x Real—> Real 


Table 1. Set of operators considered in SyGuS grammars. 


The (non-constant) operators and their SMT-LIB names and types are listed in 
Table 1. Note that we further restrict the division operator + of sort Real to 
division by value, i.e., we do not allow division by an arbitrary term of sort Real. 
We also add a set of special values of the corresponding sort to each default 
grammar. We represent bit-vector values of sort BVin] as bit-strings of length n, 
where the left-most bit is the most significant bit. For floating-point values of sort 
FPje s]; we use bit strings where the left-most bit indicates the sign, the following 
e bits represent the exponent, and the remaining bits the significand. For the 
theory of fixed-size bit-vectors, we use smax;,) or sminj,) for the maximum or 
minimum signed value of width n, e.g., smaxj4) = 0111 and sminj4) = 1000, and 
ones] for the maximum unsigned value, e.g., onesj4; = 1111. For the theory of 
floating-point numbers, we use +0 for positive and negative zero, +00 for positive 
and negative infinity, and NaN for not a number, e.g., —0j3,5) = 10000000 and 
+00/3.5) = 01110000. We further use +min® for the positive and negative smallest 
subnormal, +max® for the positive and negative largest subnormal, +min” for the 
positive and negative smallest normal, and +max” for the positive and negative 
largest normal, e.g., —max) 5] = 10001111 and +mings 5) = 00010000. In the 
definition of grammar Rpp above, we use symbol + to indicate that both the 
positive and negative variant of a special value is included in the grammar. 


We extend the above set of default grammars (grammar, in Algorithm 1) 
with ground terms that occur in an input set {Qi,...,Qn}UGo based on the sort 
of variable x € x in Q; = Vx. P{a] and a term selection strategy. This strategy 
is based on the following two factors. We consider three modes for the scope of 
ground terms: (1) ground terms that occur in quantified formula Q; (strategy 
in) (2) ground terms that occur in the set of ground formulas G (strategy out), 
and (3) the union of (1) and (2) (strategy both). We consider three modes for 
the size of ground terms, defined as the number of subterms a term consists of: 
(a) terms of minimal size, i.e., constants that occur in a term (strategy min) (b) 
terms of maximal size (strategy max), and (c) the union of (a) and (b) (strategy 
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both). For example, for a ground term a+ b- c, strategy min will select a, b,c, 
max will select a +b- c, and both will select a,b,c,a+b-c. Each of the scope and 
size modes may be combined, giving 3 * 3 = 9 possible term selection strategies. 


Example 1. Let Q =Va.a-x #a-a+b-b+2-a-b where x, a,b have integer type 
and suppose we run syqi({Q},@). The algorithm first constructs the grammar 
grammar s(x) for x, where we assume term selection strategy S with scope in 
and size min, which considers ground terms that occur in Q and are of minimal 
size (2, a, and b). This grammar is encoded as the following datatype Z: 


Z =zero | one | plus(Z,Z) | minus(Z,Z) | two | a | b 


The algorithm introduces a fresh datatype variable d, of type Z, a fresh integer 
variable eg, of integer type, and adds l > eg, ` €a, ~ a-a+b-b+2-a-b to 
the internal set G of ground formulas, where / is a fresh Boolean variable. In the 
first iteration of the loop, we have that G (and GU {1}) are satisfiable. Hence, 
the algorithm calls select_lemmas; on Q and a model Z for G; assume that 
d= = zero and ef, = a? = b* = 0. Based on the lemma selection strategy, we 
may choose to add the instantiation lemma 0:0 # a-a+6-b+2-a-b, or the 
evaluation lemma iSzero(dr) = ea, œ% 0, or both lemmas to G. Assuming both 
lemmas are added to G, the next iteration of the loop will consider a new model 
T’ where a # zero and ef # 0. The algorithm will continue finding models 
with new values for dẹ, until it finds a model Z” where dZ” = plus(a, b). At this 
point the instantiation lemma (a+b): (a+b) % a-a+b-b+2-a-b will be added 
to G, which is equivalent to false, and syqi will terminate with unsat. 


3.2 Implementation Details 


We implemented syntax-guided quantifier instantiation in the CVC4 [5] solver, 
which has support for a wide range of background theories, covering all those 
in the SMT-LIB standard library [2]. CVC4 is based on the CDCL(T) (for- 
merly DPLL(T)) framework [19]. This framework integrates a propositional SAT 
solver, which attempts to find a Boolean assignment that propositionally satis- 
fies the input formula, with one or more specialize theory solvers, which monitor 
the assignments made by the SAT solver to theory literal and flag a conflict if 
the assignments are ever inconsistent in their theory. 

Our SyQI technique is implemented as a module of the subsolver of CVC4 
that handles quantified formulas. We leverage CVC4’s support for smart enumer- 
ative SyGuS as described in Reynolds et al. [22]. Specifically, the check method 
in line 7 in Algorithm 1 involves calling the (combination) of quantifier-free the- 
ory solvers, which includes an extension of the theory of datatypes described in 
the following. 


Symmetry Breaking for Smart Enumerative Synthesis. As described in previous 
work [22,24], CVC4 uses advanced techniques for symmetry breaking for the 
datatypes over which context-free grammars are embedded. The quantifier-free 
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datatype theory solver in CVC4 is extended to issue symmetry blocking clauses 
based on reasoning about such datatypes, so that the models we generate for a 
datatype variable d are such that to_term(d) is unique with respect to rewriting. 
For example, the terms a + b and b+ a are equivalent, and in CVC4, one will 
be rewritten to the other. Thus, we know that we only have to consider one 
variant, e.g., a+b. Hence, the extended datatypes solver may issue the blocking 
clause —ispius(d) V —isp(selzi(d)) V ~isa(selz 2(d)), effectively stating that the 
term associated with d should not be b+ a. This technique is highly valuable 
for syntax-guided synthesis, since it reduces the set of terms considered in the 
search for candidate solutions. In the context of this work, these techniques are 
of great importance, since they guarantee that our algorithm does not consider 
multiple instantiations over tuples of pairwise equivalent terms. 


Quantified Formulas within Boolean Structure and Nested Quantification. As 
mentioned earlier, while not shown in Algorithm 1, our approach uses standard 
techniques for handling qeneral quantified formulas, in particular with quan- 
tifiers that occur below Boolean connectives. In the context of CDCL(T), for 
each quantified formula Q; of the form Va. P[a], the propositional model of our 
Boolean structure may either assign Q; to true or false, or leave it unassigned. 
Quantified formulas that are assigned to false are Skolemized, i.e., a lemma of 
the form =Q; = —P{k], where k are fresh constants, is returned to the SAT 
solver. Quantified formulas that are unassigned are ignored. Quantified formu- 
las that are assigned to true are either active or inactive based on the value 
assigned to their counterexample literals. Those that are active are processed 
via select_lemmas,;. In practice, instantiation lemmas are guarded so that 
Qi = Pit] is returned to the SAT solver, meaning that the conclusion only 
holds when Q; is assigned to true. Furthermore, each Q; may have nested quan- 
tification, that is, the formula P the counterexample lemma l; > —Pleg,] may 
contain quantified subformulas. Those quantified formulas are then processed by 
our full algorithm in the same way as quantified formulas from the input. 


4 Experiments 


We implemented our approach in the SMT solver CVC4 [5]. We provide here 
an extensive evaluation of the techniques and strategies described in Section 3. 
We first evaluate term and lemma selection strategies for grammar construction, 
and then compare the performance of our best configuration against Z3 [16], 
the only state-of-the-art SMT solver besides CVC4 that supports all the logics 
supported by our implementation. 

We performed all experiments on a cluster with Intel Xeon CPU E5-2620 
CPUs with 2.1GHz and 128GB memory. We used a time limit of 300 seconds, 
and an 8GB memory limit for each solver/benchmark pair and count memory 
out as time out. We evaluate here all configurations on all quantified logics 
in SMT-LIB [2] that do not contain uninterpreted functions (UF). As an ex- 
ception, we include the logic UFBV, since the benchmarks in this logic rely 
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Strategy Solved Sat Unsat TO MO Uniq Time[s] 
Term Selection Strategies 

both-max 2865 825 12040 2871 10 8 886137.3 
both-both 2848 823 12025 2887 ia 12 892219.8 
both-min 2843 819 12024 2893 10 10 893808.7 
in-both 2688 831 11857 3052 6 6 939886.7 
in-min 2673 828 11845 3065 8 4 944167.2 
in-max 2667 832 11835 3067 12 7 944952.3 
out-both 2660 785 11875 3081 5 3 948301.4 
out-min 2643 788 11855 3098 5 2 954925.1 
out-max 2616 774 11842 3127 3 6 961683.9 
Lemma Selection Strategies 

interleave 2848 823 12025 2887 11 60 892272.2 
priority-inst 2838 821 12017 2893 15 49 897454.3 
priority-eval 2721 821 11900 3019 6 52 938443.4 


Table 2. Selection strategies on considered logics (15,746 benchmarks). 


almost entirely on BV reasoning only. We generally exclude logics with UF since 
for such logics counterexample-guided techniques, as in our approach, are not 
expected to be more effective than heuristic instantiation techniques such as 
E-matching, which we confirmed in a preliminary evaluation. Overall, we in- 
clude logics BV (bit-vectors), FP (floating-point arithmetic), LIA (linear integer 
arithmetic), LRA (linear real arithmetic), NIA (non-linear integer arithmetic), 
NRA (non-linear real arithmetic), and their combinations BVFP, BVFPLRA, 
FPLRA, and UFBV. In total, our benchmark set consists of 15,746 benchmarks. 


Term Selection for Grammar Construction. As a first experiment, we 
determine the best combination of scope-based and size-based ground term se- 
lection strategies for grammar construction as introduced in Section 3.1. We 
combine strategies based on scope with strategies based on term size into nine 
selection strategies: in-min, in-max, in-both, out-min, out-max, out-both, both-min, 
both-max, both-both. The results for our SyGuS instantiation approach with 
these strategies enabled is shown in Table 2. Note that preliminary experiments 
identified lemma selection strategy interleave as the best. Hence, we use strategy 
interleave as the lemma selection strategy for this experiment. 


Overall, using strategy both for the scope performs best. Furthermore, for 
this strategy all three size-based strategies perform equally well. For the re- 
maining experiments, we use strategy both-both as the term selection strategy 
for grammar construction, where both minimal and maximal ground terms are 
selected from both the quantified formula Q; (containing the variable we con- 
struct a grammar for) and the set of ground formulas GŒ. Note that we choose the 
more general strategy both-both over strategy both-max even though both-max 
performs slightly better. 


Lemma Selection. In our second experiment, we determine the best lemma se- 
lection strategy out of the three strategies priority-inst, priority-eval and interleave 
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described in Section 3. The results are shown in Table 2. Note that we use the 
previously determined best term selection strategy both-both in this experiment. 

The best overall strategy is interleave, indicating that it is beneficial to con- 
sider instantiation lemmas and evaluation unfolding lemmas in parallel. On the 
other hand, prioritizing evaluation lemmas over instantiation lemmas (priority- 
eval) performed significantly worse than the other two configurations. Since this 
strategy prioritizes evaluation lemmas, it has the advantage over other configu- 
rations of delaying instantiations until we obtain an interpretation Z where the 
interpretation of eg, is consistent with respect to dz, i.e., eĵ, = to_term(d,)*. 
As a consequence, prioritizing evaluation lemmas puts more effort into find- 
ing terms in instantiation that are guaranteed to refine the current candidate 
model Z. However, we conclude from these results that it is often effective to con- 
sider instantiations in an eager fashion, either in parallel or even before consid- 
ering evaluation lemmas. This is likely because instantiation lemmas may often 
refine the set of possible models even when G does not yet force our evaluation 
variables to have an interpretation that is consistent with their corresponding 
datatype values. Nevertheless, we found that evaluation lemmas are often neces- 
sary in practice for ensuring our procedure does not get stuck on a single model. 
When only instantiation lemmas are used, our procedure often terminates the 
loop with no new lemmas. This is to be expected, as such a strategy violates the 
requirements for the progress property of Theorem 1. 

In the remaining experiment, we use strategy interleave as the lemma selection 
strategy since it performs slightly better than priority-inst. 


Comparison Against Other Techniques. Finally, we compare our SyGuS 
instantiation approach against other techniques implemented in CVC4, the state- 
of-the-art SMT solvers Z3 [16] (version 4.8.9) and Boolector [17] (version 3.2.1), 
and the superposition-based theorem prover Vampire [13] (version 4.5.1). Note 
that Boolector implements counterexample-guided model synthesis [20] but only 
supports the SMT-LIB logic BV, whereas Vampire supports LIA, LRA, NIA, 
and NRA. We consider the following four configurations of CVC4: ematch: 
with E-matching [15] enabled; cegqi: with CEGQI for linear arithmetic [25] and 
bit-vectors [18] enabled, falls back to value-based instantiation techniques for 
other theories; enum: with enumerative instantiation [21] enabled; syqi: with 
our SyGuS instantiation approach enabled. We use strategy both-both for term 
selection, and interleave for lemma selection. 

The results are summarized in Table 3. First, note that Z3 disagrees on 10 
benchmarks in logic FP with the other four CVC4 configurations. This is due to 
a known problem in Z3 related to operator rem, where it answers sat instead of 
unsat. We do not count these 10 benchmarks as solved and give the number of 
disagreements in parenthesis marked with a * in Table 3. 

Overall, note that E-matching (ematch) performs very poorly on these 
benchmark sets. This is not surprising since it is designed with a focus on prob- 
lems with uninterpreted functions. To a lesser extent, enumerative instantiation 
(enum) also performs poorly, probably also due to the fact that it is not designed 
for inputs without uninterpreted functions. In detail, both this configuration and 
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Logic syqi cegqi ematch enum Z3  Boolector Vampire 
BV sat 269 411 203 204 566 620 - 
5846) unsat 4752 5039 3846 4699 4934 4889 - 
unsolved 825 396 1797 943 346 337 - 

BVFP sat 113 110 26 29 174 - - 
224) unsat 14 4 4 14 11 - - 
unsolved 97 110 194 181 39 - - 

BVFPLRA sat 103 95 67 67 164 - - 
185) unsal 5 5 5 6 5 - - 
unsolved 77 85 113 112 16 - - 

FP sat 34 28 23 23 AT - - 
2484) unsa 2117 1899 83 1615 1923 - - 
unsolved 333 557 2378 846 504 (10) - - 

FPLRA sat 17 17 3 3 18 - - 
27) unsa! 0 0 0 0 0 - - 
unsolved 10 10 4 4 9 - - 

LIA sat 188 199 9 9 189 - 5 
607) unsat 319 357 46 Er 295 - 310 
unsolved 100 51 542 417 123 - 292 

LRA sat 79 593 461 461 740 - 0 
2419) unsat 955 1306 1018 1117 1454 - 871 
unsolved 1385 520 940 841 225 - 1548 

NIA sat 12 11 6 6 12 - 0 
20) unsat ite 8 1 5 5 - 6 
unsolved il 1 13 9 3 - 14 

NRA sat 0 0 0 0 2 - 0 
3813) unsat 3781 3781 3703 3768 3806 - 3803 
unsolved 32 32 110 45 5 - 10 

UFBV sat 8 8 8 8 26 - - 
121) unsal 74 53 47 66 72 - - 
unsolved 39 60 66 AT 23 - - 

Total sat 823 1472 826 830 1938 - - 
15746) unsat 12024 12452 8753 11461 12505 - - 
unsolved 2899 1822 6167 3455 1293 (10)* - - 


Table 3. SyQI vs. other techniques, Z3, Boolector, and Vampire (15,746 benchmarks). 


syqi are enumerative in nature. The former uses a selection strategy based on 
the evolving ground terms in the current context, whereas the latter uses a fixed 
grammar built from the initial set of terms. In a sense, syqi leverages the power 
of a grammar for discovering new terms, whereas enum adapts to what terms 
are generated by instantiations. Overall, syqi solves 556 more benchmarks than 
enumerative instantiation, justifying the need for a syntax-guided approach for 
instantiation for inputs that are rich in background theories. 

Our results show that syqi is remarkably competitive when compared to 
cegqi, which uses the best known theory-specific instantiation strategies. The 
performance of syntax-guided instantiation matches or exceeds counterexample- 
guided instantiation on logics BVFP, BVFPLRA, FP, FPLRA, NIA, NRA, and 
UFBV. In particular, for quantified floating-point arithmetic (FP), the perfor- 
mance of syqi significantly outperforms cegqi, where it solves 224 more bench- 
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marks. We attribute this to the fact that cegqi only performs value-based instan- 
tiation, whereas the use of grammars is effective in determining useful symbolic 
terms to use in instantiations for this theory. Interestingly, syqi solves the only 
satisfiable benchmark in the NIA category that is unsolved by cegqi, mean- 
ing that in a portfolio setting with all available configurations, CVC4 solves 
all benchmarks in this category. On the other hand, counterexample-guided in- 
stantiation outperforms syqi on logics such as LIA, LRA, and BV, where well- 
established instantiation strategies exist. Syntax-guided techniques are especially 
ineffective for linear real arithmetic, since it is often important to construct spe- 
cific real constants based on solving sets of linear (in)equalities [25]. 

Comparing all configurations of CVC4 with Z3, Boolector, and Vampire, we 
see that in some logics like LIA and NIA, counterexample-guided instantiation in 
CVC4 outperforms Z3 and Vampire, whereas in other logics like NRA, UFBV, 
and many logics that combine BV, FP and LRA, Z3 performs best. For the 
logic BV, Boolector outperforms CVC4 and Z3; however, CVC4 solves the most 
unsatisfiable instances. The syqi configuration performs best on the floating- 
point benchmarks, where it solves 181 more than the closest competitor. When 
comparing the four CVC4 configurations in terms of uniquely solved instances, 
cegqi uniquely solves 660 instances, syqi 119 instances, enum 117 instances, 
and ematch not a single one. Between configurations cegqi and syqi, the former 
uniquely solves 1479 instances, and the latter 402 instances. 

In summary, theory-specific approaches as implemented in CVC4, Z3, and 
Boolector outperform syqi in categories where instantiation strategies are highly 
mature, such as linear integer and real arithmetic, and fixed-width bit-vectors. 
Nevertheless, our evaluation demonstrates the versatility of the approach, es- 
pecially for benchmarks using quantified floating-point arithmetic or combined 
theories where no good approach to quantifier instantiation was known. 


5 Conclusion 


We have presented a syntax-guided approach for quantifier instantiation and im- 
plemented it in the SMT solver CVC4. Our experiments show that our approach 
is a viable alternative to theory-specific quantifier instantiation techniques and 
can be applied to a wide range of logics. In particular, for the theory of floating- 
point arithmetic, syntax-guided instantiation in CVC4 significantly outperforms 
the state of the art. In future work, we plan to tune our grammar construc- 
tion based on an analysis of which terms are more likely to appear in conflicts, 
which can potentially be done automatically. Another direction of future work 
is to provide an interface that would allow users to supply their own grammars 
for use in SyQI, similarly to the user-provided triggers for E-matching. We also 
plan to use our approach as a baseline for quantified logics in recent (and future) 
new theories. Currently, support in SMT solvers is highly limited, for instance, 
for quantified formulas involving the theory of strings and regular expressions. 
Syntax-guided instantiation can serve as a baseline for potential user applications 
that rely on quantified formulas in these theories. 
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Abstract Reasoning with quantifiers and theories is at the core of many 
applications in program analysis and verification. Whilst the problem is 
undecidable in general and hard in practice, we have been making large 
pragmatic steps forward. Our previous work proposed an instantiation 
rule for theory reasoning that produced pragmatically useful instances. 
Whilst this led to an increase in performance, it had its limitations as 
the rule produces ground instances which (i) can be overly specific, thus 
not useful in proof search, and (ii) contribute to the already problematic 
search space explosion as many new instances are introduced. This paper 
begins by introducing that specifically addresses these two concerns as it 
produces general solutions and it is a simplification rule, i.e. it replaces an 
existing clause by a ‘simpler’ one. Encouraged by initial success with this 
new rule, we performed an experiment to identify further common cases 
where the complex structure of theory terms blocked existing methods. 
This resulted in four further simplification rules for theory reasoning. The 
resulting extensions are implemented in the VAMPIRE theorem prover 
and evaluated on SMT-LIB, showing that the new extensions result in 
a considerable increase in the number of problems solved, including 90 
problems unsolved by state-of-the-art SMT solvers. 


1 Introduction 


Many applications of reasoning in program analysis and verification depend on 
reasoning with the first-order theory of arithmetic, often in combination with 
other theories and quantifiers. A common approach to this problem is via Satis- 
fiability Modulo Theory (SMT) solving, which has strong support for decidable 
theories but may struggle to scale in the presence of quantifiers. Conversely, 
superposition-based first-order solvers handle quantifiers naturally and have, re- 
cently, been extended to reason with theories P/3/5I6J91316[21]. Such solvers 
are based on a saturation loop and tend to suffer from search space explosion. 
This is compounded by the effective but explosive use of theory axioms, leading 
to the derivation of numerous inconsequential consequences of the theory. So far 
we have attempted to control this explosive behaviour [I0J17] but now we aim 
to eliminate some of it. This paper introduces a set of simplification rules for 
reasoning in the theory of (any combination of linear or non-linear real, rational, 
or integer) arithmetic, i.e. rules that make reasoning in arithmetic simpler. 
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This work was motivated by our previous attempt [20] to find useful instances 
of first-order clauses that would be otherwise difficult to find via reasoning with 
theory axioms. For example, when considering the two clauses 


r(7z) = =r(6 +y) V p(y) 


our previous work would apply resolution on r(7x) and =r(6+y) using unification 
with abstraction to produce the clause 7x Æ 6+ y V p(y) and then applied theory 
instantiation, utilising an SMT solver to find the substitution {x > 1,y > 1}, 
producing the instance p(1). This may or may not be useful to proof search 
and, crucially, we need to keep performing inferences with the original clauses in 
case it is not. In this case, we would prefer to instantiate with {y > 7x — 6} to 
produce 7x Æ 6 + (7a — 6) V p(7a — 6), which can be reduced to p(7x — 6). This 
is a general solution (being logically equivalent) that is also simpler — in this 
case it has fewer variables than the original clause. Hence, we replace the clause 
by the more general result, aiding proof search and preventing the addition of 
unnecessary instances. 

The above was motivated by the observation that we would often see clauses 
of the form kx Æ tV C[ax] (for numeral k, variable x, and term t) and expend 
much effort using theory axioms to rewrite ka # t into x # 4. This led us to 
conduct an experiment to identify other common cases where arithmetic clauses 
could be simplified. An immediate observation is that, if x ranges over the reals, 
p(7x—6) can be instantiated with {a > W5) to produce p(y). Furthermore, in 
the above example we no longer need to employ the expensive unification with 
abstraction as we can instantiate r(7x) with {x > 2} to produce r(z) and then 
resolve with r(6 + y) V p(y) to produce p(y) directly. 

Another observation was that a large amount of effort was expended by 
the theorem prover reordering sums and products to expose seemingly obvious 
structure. For example, taking (3t + x) + 2t and producing 5t + x requires three 
theory axioms and 12 rewriting steps. To combat this, we introduce an evaluation 
method that flattens sums and products, reorders and simplifies them, before 
reintroducing the necessary bracketed structure. A related common issue was the 
occurrence of terms that could easily be cancelled, such as in 4z + 3 < 4x + 10, 
again requiring significant rewriting effort that can be replaced by a special rule. 

This paper does not present the exploratory experimentation described above 
but focusses instead on the fruits of this work. After introducing the necessary 
preliminaries (Sec. B}, we make the following contributions: 


— A new Gaussian Variable Elimination rule (Sec. [3p that eliminates variables 
if they can be described completely in terms of other variables. 

— A set of Arithmetic Subterm Generalisation rules (Sec.[4} that replace clauses 
with obvious generalisations, as in the above cases of replacing p(7x—6) with 
p(y) and r(7x) with r(x). 

— A general approach to the evaluation of terms involving arithmetic (Sec. B), 
including a special rule to handle a surprisingly common corner case in- 
volving unary minus. 


166 G. Reger et al. 


— A rule for cancelling subterms, e.g. in 4a + 3 < 4x + 10 (Sec. (6) 


These rules are all implemented in the VAMPIRE theorem prover. Our 
experimental evaluation (Sec. |7) shows that the new rules significantly improve 
the number of problems (from SMT-LIB) that VAMPIRE can solve. Our final 
experiment shows that the new VAMPIRE can solve 1,052 problems unsolved by 
VAMPIRE 4.5, 1,056 problems unsolved by CVC4, and 1,350 problems unsolved 
by Z3 — given their complementary nature, this equates to 90 problems unsolved 
by any of these state-of-the-art solvers. 


2 Preliminaries and Related Work 


First-Order Logic and Theories. We consider a many-sorted first-order logic 
with equality. A signature is a pair X = (£, Q) where £ is a set of sorts and 
N a set of predicate and function symbols with associated argument and return 
sorts from =. Terms are of the form c, x, or f(ti,...,tn) where f is a function 
symbol of arity n > 1, ti,...,tn are terms, c is a zero arity function symbol (i.e. 
a constant) and x is a variable. We assume that all terms are well-sorted and 
write t : ø if term t has sort ø. Atoms are of the form p(ti,...,tn),q or ti ~s te 
where p is a predicate symbol of arity n, t1,...,tn are terms, q is a zero arity 
predicate symbol and for each sort s € 2, ~, is the equality symbol for the sort 
s. We write simply ~ when s is known from the context or irrelevant. A literal is 
either an atom A, in which case we call it positive, or a negation of an atom 7A, 
in which case we call it negative. When L is a negative literal A and we write 
aL, we mean the positive literal A. For negative literals with binary predicates 
a(t, > t2) (like, e.g. equality), we sometimes write tı & ta. 

A clause is a disjunction of literals Lı V...V Ln for n > 0. We disregard 
the order of literals and treat a clause as a multiset. When n = 0 we speak of 
the empty clause, which is always false. When n = 1 a clause is called a unit 
clause. Variables in clauses are considered to be universally quantified. Standard 
methods exist to transform an arbitrary first-order formula into clausal form 
(e.g. [I5] and our recent work in [I9}). 

In the following we use expression to mean a term, an atom, a literal, or a 
clause. We write E[t], to denote an expression F containing a term t at position 
p (a position is a unique point in an expression’s syntax tree) and may then 
write E[s], to denote the same expression with t replaced by term s at p. We 
will normally leave the position p as implicit. A substitution is any 0 of the form 
{x1 > t1,...,@n +> tn}, where n > 0. E@ is the expression obtained from Æ 
by the simultaneous replacement of each x; by t;. An expression is ground if 
it contains no variables. An instance of E is any expression Æ and a ground 
instance of E is any instance of E that is ground. A unifier of two terms, atoms 
or literals Ey and E> is a substitution 6 such that E10 = E20. It is known that 
if two expressions have a unifier, then they have a so-called most general unifier. 

We assume a standard notion of a (first-order, many-sorted) interpretation T, 
which assigns a non-empty domain Z, to every sort s € =, and maps every func- 
tion symbol f to a function Zs and every predicate symbol p to a relation Z, on 
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these domains so that the mapping respects sorts. We call Zy the interpretation 
of f in T, and similarly for Z, and Zs. Interpretations are also sometimes called 
first-order structures. A sentence is a closed formula, i.e. with no free variables. 
We use the standard notions of validity and satisfiability of sentences in such 
interpretations. An interpretation is a model for a set of clauses if (the universal 
closure of) each of these clauses is true in the interpretation. 

A theory T is identified by a class of interpretations. A sentence is satisfiable 
in 7 if it is true in at least one of these interpretations and valid if it is true in 
all of them. A function (or predicate) symbol f is called uninterpreted in T, if 
for every interpretation Z of T and every interpretation Z’ which agrees with Z 
on all symbols apart from f, T’ is also an interpretation of T. A theory is called 
complete if, for every sentence F' of this theory, either F or ~F is valid in this 
theory. Evidently, every theory of a single interpretation is complete. We can 
define satisfiability and validity of arbitrary formulas in an interpretation in a 
standard way by treating free variables as new uninterpreted constants. 

The theories we will deal with are the theories of integer, rational, and real 
arithmetic with uninterpreted functions, denoted by 7z, TQ, and Tp, which fix the 
interpretation of a distinguished sort oz, ag, and op to the set of mathematical 
integers Z, rationals Q, and reals R respectively, and assign the usual meanings to 
the function and predicate symbols {+,—,<,<,-}. By k, we denote the numeral 
interpreted as k in any of these theories. We consider signatures over these 
theories to additionally contain uninterpreted functions, and predicates, hence, 
in contrast to the case without unintpreted functions, for none of these theories 
there is a sound and complete proof system (see e.g. [I3]). 

Unless stated differently, we use the symbols x,y,z for variables, s,t,u for 
terms, C, D for clauses, p, q,r for predicate symbols, f,g,h for function symbols, 
and o for substitutions, and sorts, with sometimes suffixes being added. 


Term Orderings. A simplification ordering (see, e.g. [8]) on terms is an ordering 
that is well-founded, monotonic, stable under substitutions and has the subterm 
property. Such an ordering captures a notion of simplicity, i.e. tı < t2 implies 
that tı is in some way simpler than t2. VAMPIRE uses the Knuth-Bendix or- 
dering [12], which is parametrized by total precedence ordering on function and 
predicate symbols <. This is total on ground terms and partial on non-ground 
ones, leading to the possibility of incomparable terms, e.g. f(x,a) and f(b, y). A 
simplification ordering < on terms can be extended to a simplification ordering 
on literals and clauses, using a multiset extension of orderings. For simplicity, 
we will use < to refer to the term ordering and its lifting. Whenever FE, < Ez 
(E2 < E1) we say that E; is smaller (bigger) than E2. An equality literal t ~ s 
is oriented if t < sors Xt. 


Saturation-Based Proof Search. We introduce our new rules within the context of 
saturation-based proof search. The general idea in saturation is to maintain two 
sets of Active and Passive clauses. A saturation-loop then selects a clause C from 
Passive, places C in Active, applies generating inferences between C and clauses 
in Active, and finally places newly derived clauses in Passive after applying some 
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retention tests. The retention tests involve checking whether the new clause is 
itself redundant (i.e. a tautology) or redundant with respect to existing clauses 
(implied by a set of smaller clauses in Active U Passive). Rules that remove the 
parent clause immediately from the search space without performing a retention 
test are called immediate simplification rules. Whenever there are applicable 
immediate simplification rules, the first one wrt. some fixed ordering is chosen 
to be applied to the selected clause instead of applying any other rule. The rules 
introduced in this paper are all introduced as immediate simplification rules. 
However, as mentioned later, not all of them strictly obey the requirement that 
the result is smaller. Normally this would have implications on the completeness 
of the approach but we lose completeness when we start reasoning with theories. 
This leads us to a trade-off between the potential loss of some proofs by missing 
some inferences, and the potential gain via simplifying proof search. Our later 
experimental results show that forgoing completeness is of pragmatic interest. 


Superposition Calculus. VAMPIRE works with the superposition and resolution 
calculus (see our previous work for a description). The calculus itself is 
not of direct interest to this work. We do, however, draw attention to two rules. 
Firstly, the Equality Resolution rule 


stve 


To 0 is a most general unifier of s and t 


is a starting point for both our previous theory instantiation work and the Gaus- 
sian Variable Elimination rule introduced later (Sec. B}. Secondly, we draw at- 
tention to the Demodulation (or rewriting by unit equalities) rule 


lær Lye 
Liro] ve 


where 10 = t,r0 < 10, and (Lœ r)0 < L[t] v C. This is of interest as later we will 
need to take special care of the last side-condition when evaluating terms. 


Theory Reasoning. To perform theory reasoning within this context it is common 
to do two things. Firstly, to evaluate new clauses to put them in a common form 
(e.g. rewrite all inequalities in terms of <) and evaluate ground theory terms and 
literals (e.g. 1+2 becomes 3 and 1 < 2 becomes false). More complex evaluation 
is possible and is the subject of this work (see Section 5). Secondly, relevant 
theory axioms can be added to the initial search space. For example, if the input 
clauses use the + symbol one can add the axioms z +y ~ y +x and z+0 rz, 
among others. 

In addition to these basic methods, VAMPIRE also employs a number of other 
techniques. AVATAR modulo theories [16] uses an SMT solver within the con- 
text of clause splitting to ensure that the ground part of any chosen clause splits 
are theory-consistent. The previously mentioned unification with abstraction and 
theory instantiation rules support lazy unification modulo theories and prag- 
matic instantiation. Theory axiom usage can be controlled by the set of support 


Making Theory Reasoning Simpler 169 


strategy [I7] or layered clause selection [10]. Both approaches de-prioritise reas- 
oning with theory axioms. 


3 Gaussian variable elimination 


Recall the example 7x Æ 6 + y V p(y) from the Introduction (Sec. |1) where we 
want to identify the substitution {y > Tx — 6} to produce the simpler instance 
p(7x — 6). Our general approach is to rewrite T£ Æ 6 + y in terms of y and then 
apply the standard Equality Resolution rule introduced in Sec. |2| This gives us 
the straightforward rule: 


s £% tV Cla] 


Cla gve 


where x: oz, ©: OQ, OY £ : OR, (8, t) =—>žve (TZ, U), OF (t, s) =—3ve (£, u) and z is 
not a subterm of u. The relation =>;,. is the reflexive, and transitive closure of 


the relation —>,ye which can be defined as follows. 


(s +t, u) =>grve (8,u + (— t)) 

(s + t,U) =se (tu + (— 8)) (= 8, t) =se (8, — t) 
(s-t,u) ype (5,u / t) if t #0, andt: oQ, or t: OR 
(S-t,u) gre (t,u / 8) if s £0, and $: og, or t : op 


(s / t,u) = ge (8, u- t) if t £0, and t: og, or t: op 


It should be noted that =>,ye is not normalising. The pair (sı + s2,t) can, 
for example, be rewritten to (s1,t— s2), as well as to (s2,t— s1). But due to the 
fact that there is at most a linear number of such rewritings, we can enumerate 
all of them and choose the first (x,t), such that x is not a subterm of t. Further 
choice comes from the fact that we can either rewrite based on (l, r}, or based 
on (r,l). Looking at our example, we could rewrite 


(6 +y, T£) =>we (Y, Tz — 6) 


but also 
(70,6 +y) =gve (x, (6 +y) / 7) 


if x is not of integer sort, leaving us with a choice. Another source of choice 
comes from the fact that our premise can contain multiple negative equalities. 
Any of those could potentially be used to rewrite the rest of the clause. 

Since application of the rule, will yield a logically equivalent conclusion, with 
fewer literals and fewer distinct variables, we make an arbitrary choice. For 
the same reason, we implement this as a simplification rule (thus removing the 
premise from the search space) even though the conclusion will often be incom- 
parable to (not smaller than) the premise. 

To further demonstrate this rule we consider the additional example 
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p(7xxxy — 6) 
p(7xxx — 6) 
p(7x — 6) 
p(x — 6) 
p(z) 


ASB var 
pow 
ASB var 


aSgnum 


asgt 


Figure 1. Illustration of the 4 generalization rules, in the theory of Reals. 


coty#36V x + 3y #90 V plz, y) 
(36 — y) + 3y # 90 V p(36 — y, y) 
a a Y, y) 
p( 
( 


gve 


eval 


36 — (90 — 36)/2, (90 — 36)/2) | 
a 


eval 


which highlights the need to interleave evaluation between successive Gaussian 
elimination steps — we discuss our evaluation strategy below. 


4 Arithmetic subterm generalization 


Taking a closer look at the choice for our example from the previous section, 
we see that we could have instantiated the premise y + 6 % 7x V p(y) either 
with {y > 7x — 6} to get p(7x — 6), or with {x + (6 + y) / 7} to obtain p(y) 
(again, assuming that « is not of integer sort). Both of the clauses are logically 
equivalent in 7g, and 7p, since the earlier is an instance of the latter, and the 
latter implies the earlier as we can apply the substitution {x + (y +6) / 7} and 
simplify the result to the earlier clause. Obviously this kind of reasoning can be 
applied for any linear subterm E- x +d where k # 0. 

Splitting this idea into multiple rules lets us take these generalizations fur- 
ther. Therefore we propose 4 rules for arithmetic subterm generalization, that 
are illustrated in a single example in Figure 

Since we do not want the applicability of our generalization rules to depend 
on associativity and commutativity (AC) we will formulate them modulo AC. 
For this purpose we introduce the following notation. We use C|t] ac to denote a 
clause that contains the subterm t modulo AC. Further we use C [t] 4c to denote 
the same clause, but all occurrences of t modulo AC, being replaced by t’. 


Addition Generalization 


Cla + ti +...+tnlac 
Cla]ac 


+ 


asg 


where 


— x: ø for some o € {0z, 09, oR} 
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— all occurrences of x are in the subterm x + tı +... + tn (modulo AC) 
— x is not a subterm of t; 


The first rule deals with the case where a clause contains a sum with a variable 
as summand. Such a sum can be generalized by applying the substitution {x => 
x —t, —...— tn} , and simplifying the result. 


Numeral Multiplication Generalization 


A~ 


Ck- xti... talac 
Cjr- ti... tnlac 


where 


— «:o for some a € {0Q, oR} 


Aa 


— all occurrences of x are in the term k- x - tı - ... + tn (modulo AC) 
— x is not a subterm of t; 


In the second rule we generalize a product that contains one variable that occurs 
only once in this product. Its soundness is justified by the substitution {x => Ẹ}. 


Variable Multiplication Generalization 


Cla - £1- ... © Enlac 


where 


— x:o for some o € {072, 0Q, oR} 

— all occurrences of x, x; are in the term z - x1 -...+ &n (modulo AC) 

- LF 4; 
In this rule we generalize subterms that are products of variables, containing 
redundant variables. The rule is sound since we can replace x; by 1. 


Variable Power Generalization 


Cla"]ac pow 
k aSgyar 
C[z"]ac 
where 
— T: OR 
— x” is an abbreviation for z- x- ..- x 
k- 1 ifn is odd 


2 if n is even 
— all occurrences of x are in the term z” (modulo AC) 


The last rule lets us generalize away redundant powers of variables. Its soundness 
is guaranteed by the fact, that for Real numbers the co-domains of z” and x* 
are the same. 

All of the above rules produce a result that is smaller with respect to any sim- 
plification ordering due to the removal of terms, justifying their implementation 
as immediate simplifications. 
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5 Evaluation 


As mentioned above, reasoning with arithmetic often requires us to be able to 
evaluate terms — evaluations such as 3+ 3 => 6 and f(x) +0 => f(x) are 
straightforward but we also want to support evaluations such as (3¢+2) + 2t => 
5t +a for variable x and arbitrary term t. We introduce a new method for this 
(replacing a previous ad-hoc method implemented in VAMPIRE). The general 
idea is to first rewrite terms into a special normal form, apply simplifying steps 
that preserve this form, and then denormalise to obtain standard terms again. 
We describe the three steps in detail below. 


Normalization. This step removes the need to take care of reordering and brack- 
eting of terms. Our general normal form is as follows 


~ 


Ci * (tia Tee -tip,) + EG, : (tat tras tik, ) 

where tij ~1 ti j+1 and (tii eaei tiki) <2 (ti+1,1 Eaa * ti+t,kişı)- To get 
to this normal form we rewrite —t as —1 - t, rewrite t / Cas t- L, rewrite t as 
1 - t where necessary, and sort with respect to <ı and <2. Both relations <1, 
and <2 need to be strict total orderings, on terms, and <1-sorted lists of terms 
respectively. VAMPIRE uses so-called aggressive sharing for terms, meaning that 
for each distinct term there is at most one instance present in memory, and copies 
are being made by copying the term’s id. Hence we can define <; as comparing 
the ids of two terms. We use the same approach for <2. 


Simplification. Once in normal form, terms can be simplified by joining coeffi- 
cients for identical terms and removing terms multiplied by zero. This can be 
given as follows: 


C-t... d... U Seva Cd ts... u 
s+. t+ Sst. Humas eG toe t te 


s+...+0-t+... Sai eee +u 


If we would generate an empty sum by removing an addition we will simplify to 0 
instead. All of these steps can be implemented in linear time and in a bottom up 
manner, since we firstly can rely on the terms being sorted by the non-numeral 
parts of their summands, and secondly on a numeral part of a product being on 
a fixed position. 


Denormalisation. Finally, as the normal form contains redundant information 
(such as 1-t+... instead of (+...) we need to denormalise as follows: 


446 (tiaa tn) => (ti C. (taa (= ta))-.-)) 
1o (tiea tn) => Gb eg A) 
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We define the rule eval to be the chain of normalising, simplifying and de- 
normalising a clause in a bottom-up manner, which is only applied if the step 
of simplification is successful for some subterm. The reason for not always ap- 
plying the rules is to prevent arbitrary reordering of sums and products, which 
in many cases leads to conclusions being bigger than the premise. This can have 
significant consequences beyond perturbing proof search. Consider the following 
scenario involving the Demodulation rule (see Sec. B}. 


at+tyr~yt@ k= T l 
k=a+(c+b) demodulation 
k=at(b+o) eva 


This process would repeat itself ad infinitum as the initial clause is deleted, 
replaced by an identical clause. Evaluation would violate the side-condition that 
should have prevented this, if we would not insist on the step of simplification 
being successful for the rule to be applied. 

In most cases this inference rule is a true simplification wrt. our simplification 
ordering, since we eliminate at least one symbol in each of the cases in the step 
simplification. Due to generating sometimes bigger terms in the normalisation, 
like in the case x +g => 1-x+1-2% => 2.x we sometimes violate the simplifica- 
tion ordering. Due to the fact that these cases do not occur too frequently, and 
completeness is not possible in our base theories, we ignore these violations. 

During experimentation, we discovered many cases where a unary minus 
blocks our evaluation rule. Consider the following desired derivation 


y+ttArvClyt+—a] 


Cly+ —(y +t) 
Cly + (—y + —t)] 
Cit] 


This is not currently possible as the weight of —y + —t is 5, which is larger than 
the weight of —(y + t), meaning the second step is not a simplification. 

We introduce a simple fix by modifying the weight function and symbol 
precedence of the Knuth-Bendix ordering as follows: 


1. We let — to be weight 0 (for every sorted version of —) 
2. We let — be the largest symbol among symbols of its sort 


As a result we can use the following rewrite rule as an additional simplifaction 
rule, since the right hand side has the same weight as the left hand side, but —, 
the outer most symbol on the left hand side, has higher precedence than + the 
one on the right hand side. 


_ (x + y) push— ( x) H ( y) 


174 G. Reger et al. 


6 Cancellation 


The motivation for our last rule was two-fold. Firstly evaluation of constant 
predicates can be helpful in some cases, but fails in seemingly trivial cases. One 
example for a case like this is the redundant literal 4x +3 < 4x +10. The simple 
approach of evaluating interpreted predicates fails since we are dealing with non- 
ground symbols. However it can be simplified to a ground term that can then 
be evaluated, by cancelling away the 4x on both sides of the inequality. 

The second motivation were cases where unification with abstraction yields 
literals in which gve could almost be applied but require a step of cancellation. 
An example for such a case is the derivation 


pdx) ~p(3x) v Cla] 


3x 4 5x V Cla] 
04 2x V C|z] o 
co À 


In order to resolve both of these cases we propose the inference rule cancel- 
lation cancel, which consists of the following two symmetric cases depending on 
which side is cancelled. 


s+...nt...tugut+...nt...+wvC 


st...¢udut+...t+wVvC cane 
where 
QE re eae E 
s+...nt...tugut+...mt...t+wVC 
= cancel 
s+...¢uQut...m—nt...+wVC 
where 


—~m—-n<n—m 


= Oe a he 


s+...nt...tugut+...mt...+wVC 
st...n—mt...tudu+t...¢uwVC 


cancel 


where 


~n—-m<m—n 
= QE aa E ac 


In order for the rule to not be sensitive to associativity and commutativity, we 
perform the same steps of normalisation and denormalisation as for the rule eval. 
Again we will only simplify a clause, if cancellation itself, not only normalisation 
and denormalisation, is applicable. 

The rule is a simplification rule since the number of symbols is reduced with 
(almost) every application of the cancellation. 
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Table 1. Compares the number of problems solved with any configuration where a 
new option is enabled to the ones where it is disabled, with a runtime of 10 seconds. 
The column “both” lists how many were solved in either case. The columns “on”, 
and “off” list how many additional problems could have been solved with the option 
enabled, or disabled respectively. 


on both off 
gve |121 3372 104 
eval | 323 2927 347 
asg |440 2859 298 
push” |112 3378 107 
cancel|576 2749 272 


7 Experimental evaluation 


We describe two experiments to establish the impact of the new rules. The first 
experiment compares the new rules to each other, whilst the second experiment 
aims to determine how helpful the new rules will be in designing extensions to 
VAMPIRE’s portfolio mode. This is a standard approach to evaluating the benefit 
of new features in an automated theorem prover [18]. 


Experimental Setup. We implemented the rules as immediate simplification rules 
in VAMPIRE 4.5 (the implementation is available from the GitHub repository 
linked from the VAMPIRE website [I], on the branch integer-arithemtic). We 
selected a suitable subset of problems as follows. We started with the set prob- 
lems of 56,210 from SMT-LIB that involve quantifiers and arithmetic. In a first 
step we filtered out benchmarks that VAMPIRE could solve within 1 second in 
both default mode (which involves a simpler version of the rule eval), and in 
default mode with eval enabled. Our main experiments were carried out on the 
remaining set of 21,512 benchmarks, we which will refer to as B. Filtering out 
trivial benchmarks avoids the results containing noise from benchmarks that 
can easily be solved and is an approach recently adopted by SMT-COMP [22]. 
Experiments are run on a Linux cluster where each node contains two octa- 
core 2.1 GHz Intel Xeon processors and 160GB of RAM. The raw results of our 
experiments can be found on GitHulf] 


Experiment 1. In our first experiment we wanted to find out which are the best 
combinations of new rules, and whether the rules themselves have a positive im- 
pact on proof search. Therefore we ran VAMPIRE in each of the 32 configurations 
C resulting from enabling or disabling each of the 5 groups of rules (asg, gve, 
eval, push, and cancel) over B with a timeout of 10 seconds. 

The results are given in Table[I]showing the total number of problems solved 
and the problems gained/lost when compared to the default mode with no op- 
tions set. Each row represents the combination (union) of 16 strategies where 


3 


https: //github.com/vprover/vampire_publications/tree/master/ 
experimental_data/TACAS-2021-THEORY- REASONING 
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Table 2. The top 10 strategies in the greedy ranking of configurations. 


solved|id]eval gve asg push” cancel 
2546/15} V Vv vo Vv v 

548/24 v 

136/27 v v v 
63/22 V vy 
519] Vv v v 
38) 4] v vV 
27/23 vv v 
20/26 v v 
19/25 v v 
1815] v v v 


Table 3. The symmetric difference in number of problems solved between the three 
new strategies in portfolio mode against VAMPIRE 4.5. Each cell indicates the number 
of problems solved by the row solver unsolved by the column solver. The column unique 
lists how many problems each strategy could solve that no other strategy could. The 
strategy VAMPIRE * is what we can solve with either of the three other strategies. 
VAMPIRE * is not taken into account for uniqueness. 


strategy|total unique] VAMPIRE * 15 24 27 VAMPIRE 4.5 


VAMPIRE *|7511 0 0 622 937 932 1052 


15/6889 64 0 0 865 729 824 

24/6574 12 0 550 0 261 366 

2716579 2 0 419 266 0 165 
VAMPIRE 4.5/6506 10 47 441 298 92 0 


that option is turned on. This shows that, with the exception of evaluation, the 
gains outweigh the losses, sometimes considerably. This result for evaluation tells 
us that the other rules can still operate effectively without our new evaluation 
and, further, that the two evaluation methods are in some sense complementary. 
Therefore, whilst we explore this further, we will keep both evaluation methods. 
The most significant gains are with cancellation, which may be related to the 
fact that it is applicable to inequalities as well as equalities. 


Greedy Ranking. Another way of looking at the results of Experiment 1 is to 
create a greedy ranking rank of all configurations C, starting with the set of all 
configurations, and ranking the configuration solving the most benchmarks in B 
as the best, ranking the one that solves most of the remaining benchmarks as 
second, and so on. The top 10 strategies in this ranking are given in Table 
The overall best strategy uses all 5 of the new rules. Interestingly, the second 
best strategy only uses the gve rule. This ranking indicates the most promising 
strategies to use in our next experiment. 


Experiment 2 In our second experiment we wanted to see how many new prob- 
lems we can solve with the new simplification rules compared to our current 
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Table 4. Comparing our new approach, VAMPIRE *, against VAMPIRE 4.5, Cvc4, and 
Z3 with results separated by logic. The notation (+a, —b) means that the solver solved 
a problems the new VAMPIRE could not solve, and the new vampire could solve b the 
other solver couldn’t. The entries a(b) in the column VAMPIRE *, list the number a of 
problems that could be solved by our new rules, and b the number of these problems 
that could not be solved by any of the other solvers. 


count| VAMPIRE *| VAMPIRE 4.5 Cvc4 Z3 
ALIA 24| 14 (0) 12 (+0, -2 23 (+9, -0) 24 (+10, -0) 
AUFDTLIA| 134| 39 (0) 39 (+0, -0 86 (+47, -0) 80 (+45, -4) 
AUFLIA 862| 312 (4) 311 (+4, -5 295 (+84, -101) 331 (+148, -129) 
AUFLIRA 1697/1364 (0) 1354 (+0, -10) 1455 (+101, -10) 1453 (+102, -13) 
AUFNIA 3 0 (0) 0 (+0, -0 0 (+0, -0) 0 (+0, -0) 
AUFNIRA | 509] 87 (2) 81 (+2, -8 87 (+20, -20) 63 (+16, -40) 
LIA 246| 79 (0) 78 (+0, -1 246 (+167, -0) 230 (+155, -4) 
LRA 2043}1013 (41) | 365 (+0, -648) 1528 (+635, -120) |1756 (+883, -140) 
NIA 11 1 (0) 1 (+0, -0 9 (+9, -1) 5 (+4, -0) 
NRA 101| 92 (0) 91 (+0, -1 72 (+0, -20) 96 (+9, -5) 
UFDTLIA 274| 120 (4) 115 (+0, -5 40 (+3, -83) 34 (+1, -87) 
UFDTLIRA 33 0 (0) 0 (+0, -0 33 (+33, -0) 33 (+33, -0) 
UFLIA 4833|1924 (23) |1829 (+30, -125) | 2314 (+501, -111) 1899 (+315, -340) 
UFLRA 7 2 (0) 2 (+0, -0 2 (+0, -0) 5 (+3, -0) 
UFNIA 10735]2463 (16) |2227 (+11, -247) | 4928 (+3055, -590) | 3858 (+1983, -588) 
Any Logic |21512/7510 (90) [6505 (+47, -1052)/11118 (+4664, -1056)| 9867 (+3707, -1350) 
best effort in VAMPIRE 4.5. Therefore we ran VAMPIRE with the three top rank- 


ing configurations of experiment 3 forced added on top of VAMPIRE’s portfolio 
mode. The portfolio mode executes a sequence of strategies heuristically chosen 
based on problem features. Forcing a configuration of new options on top of this 
forces each strategy to make use of the new options. We ran this experiment 
over B with a timeout of 200 seconds. 

Results are given in Table |3} and show that the new rules allow VAMPIRE 
to solver considerably more problems (1052) than it could before whilst losing 
relatively few (47). The best configuration of options (all five new rules) solves 
the most with the other two configurations solving roughly the same. The in- 
teresting point here is that they remain complementary, solving a large number 
of problems uniquely. These are the exact conditions we require for producing 
a new, powerful portfolio mode. It is likely that performance will improve even 
further when also considering other option combinations. 

Finally, Table |4| compares the number of problems solved by either of the 
three top strategies — referred to as VAMPIRE* — against VAMPIRE 4.5, Z3 [7] 
and Cvc4 [M]. Results are further separated by the logic in which the bench- 
marks belong — A stands for Arrays, UF stands for Uninterpreted Functions, 
DT stands for Data Types, L stands for Linear, N for Non-linear, I stands for 
Integers, R stands for Reals, with the final A standing for Arithmetic. Here we 
notice that the new rules make a considerable impact in the case of pure linear 
real arithmetic. This is likely due to the fact that the asg allows us to fully 
generalise away most linear terms and gve will be broadly applicable without 
uninterpreted functions. It is interesting to note that, whilst the new VAMPIRE 


178 G. Reger et al. 


solves fewer problems than Cvc4, and Z3 overall, it solves many (1056, and 
1350) problems that the other provers do not solve. The most striking result 
is that we can solve 90 new problems, neither VAMPIRE 4.5 nor either of the 
state-of-the-art SMT solvers could solve. 


8 Conclusion 


We have motivated and introduced five new simplification rules for reasoning 
in the theory of arithmetic within saturation-based first-order theorem provers. 
These rules were implemented within the VAMPIRE theorem prover and demon- 
strated to improve the reasoning power on problems taken from SMT-LIB. It 
remains future work to explore the ideal combinations of these rules and existing 
proof search heuristics. It also remains an open question whether we can design 
an evaluation rule and modified simplification ordering that ensures that every 
evaluation that we want to perform is a true simplification. As demonstrated, 
this is not necessary pragmatically but would be satisfying theoretically. 
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Abstract. Stability is required for real world controlled systems as it 
ensures that those systems can tolerate small, real world perturbations 
around their desired operating states. This paper shows how stability for 
continuous systems modeled by ordinary differential equations (ODEs) 
can be formally verified in differential dynamic logic (dL). The key insight 
is to specify ODE stability by suitably nesting the dynamic modalities of 
dL with first-order logic quantifiers. Elucidating the logical structure of 
stability properties in this way has three key benefits: i) it provides a flex- 
ible means of formally specifying various stability properties of interest, 
ii) it yields rigorous proofs of those stability properties from dL’s axioms 
with dL’s ODE safety and liveness proof principles, and iii) it enables 
formal analysis of the relationships between various stability properties 
which, in turn, inform proofs of those properties. These benefits are put 
into practice through an implementation of stability proofs for several 
examples in KeYmaera X, a hybrid systems theorem prover based on dL. 


Keywords: differential equations, stability, differential dynamic logic 


1 Introduction 


The study of stability has its roots in efforts to understand mechanical systems, 
particularly those arising in celestial mechanics [15,19,30]. Today, it is an im- 
portant part of numerous applications in dynamical systems [34] and control 
theory [14,18]. This paper studies proofs of stability for continuous dynamical 
systems described by ordinary differential equations (ODEs), such as those used 
to model feedback control systems [14,18]. For such systems, ODE stability is 
a key correctness requirement [2] that deserves fully rigorous proofs alongside 
other key properties such as safety and liveness of those ODEs [28,36]. Despite 
this, formal stability verification has received less attention compared to proofs 
of safety and liveness, e.g., through reachability or deductive techniques [8]. 
Stability for a continuous system (or ODEs) requires that i) its system state 
always stays close to some desired operating state(s) when initially slightly per- 
turbed from those operating state(s), and ii) those perturbations are eventually 
dissipated so the system returns to a desired operating state. These properties 
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are especially crucial for engineered systems because 
they must be robust to real world perturbations de- 
viating from idealized system models. Simple pendu- 
lums provide canonical examples of stability phenom- 
ena: they are always observed to settle in the rest po- 
sition of Fig.1 (bottom) after some time regardless 
of how they are initially released. In contrast, the in- hi be 
verted pendulum in Fig.1 (top) is theoretically also in y5 
at a resting position but can only be observed tran- 
siently in practice because the slightest real world per- 
turbation will cause the pendulum to fall due to grav- 
ity. Stability explains these observations—the resting 
position is (asymptotically) stable while the inverted 
position is unstable and requires active control to en- 


Fig.1. A pendulum (in 
green) hung by a rigid 
rod from a pivot (in 
black) perturbed from its 


sure its stability. Proofs of safety and liveness proper- resting state (bottom) 
ties are still required for the inverted pendulum under and from its inverted, 
control, e.g., its controller must never generate unsafe upright position (top). 
amounts of torque and the pendulum must eventually Perturbed states (with 


dashed boundaries) are 
faded out to show the 
progression of time. 


reach the inverted position. The triumvirate of safety, 
liveness, and stability is required for holistic correct- 
ness of the inverted pendulum controller. 

The classical way of distinguishing the aforemen- 
tioned stability situations is by designing a Lyapunov 
function [19], i.e., an energy-like auxiliary measure 
satisfying certain arithmetical conditions [14,18,31] 
which implies that the auxiliary energy decreases 
along system trajectories towards local minima at 
the stable resting state(s), see Fig.2. Prior ap- t 


A Lyapunov Function 


B- 
a 


proaches [1,12,17,21,33] have emphasized the need to 
formally verify those arithmetical conditions in order 
to guarantee that a conjectured Lyapunov function 


Fig. 2. A Lyapunov func- 
tion that decreases along 
the pendulum trajectory 


correctly implies stability for a given system. shown in Fig. 1 (bottom). 


This paper shows how deductive proofs of ODE stability can be carried out 
in differential dynamic logic (dL) [25,26,27], a logic for deductive verification of 
hybrid systems.! The key insight is that stability properties can be specified 
by suitably nesting the dynamic modalities of dL with quantifiers of first-order 
logic. The resulting specifications are amenable to rigorous proof by combining 
dL’s ODE safety [28] and liveness [36] proof principles with real arithmetic and 
first-order quantifier reasoning. This makes it possible to syntactically derive sta- 
bility for a given system from the small set of dL axioms which, in turn, enables 
trustworthy stability proofs in the KeYmaera X theorem prover for hybrid sys- 
tems [11,26]. Notably, this approach directly verifies stability specifications, which 


1 Hybrid systems are mathematical models describing discrete and continuous dynam- 
ics, and interactions thereof. This paper’s formal understanding of ODE stability is 
crucial for subsequent investigation of hybrid systems stability [5,13,20]. 
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goes beyond verifying arithmetic that imply those specifications [1,12,17,21,33]. 
This is crucial for advanced stability notions because those variations generally 
require subtle twists to the required arithmetical conditions on their Lyapunov 
functions [14]; proofs of stability specifications alleviate the onus on system de- 
signers to correctly pick and check the appropriate conditions for their applica- 
tions. Section 3 shows how various stability properties for ODE equilibria can be 
formally specified and proved in dL with Lyapunov function techniques. Section 4 
generalizes those stability specifications, yielding unambiguous formal specifica- 
tions of advanced stability properties from the literature [14,18], along with their 
derived proof rules. These specifications also provide rigorous insights into the 
logical relationship between various stability notions, which are used to inform 
their respective proofs. Section 5 illustrates the practicality of this paper’s dL 
approach through several stability case studies formalized in KeYmaera X. 

All omitted definitions and proofs are available in the supplement [35]. 


2 Background: Differential Dynamic Logic 


This section briefly recalls the syntax and semantics of dL, focusing on its con- 
tinuous fragment which has a complete axiomatization for ODE invariants [28]. 
Full presentations of dL, including its discrete fragment, are elsewhere [26,27]. 


Syntax and Semantics. The grammar of dL terms is as follows, where x € V 
is a variable and c € Q is a rational constant. These terms are polynomials over 
V (extensions with Noetherian functions [28] such as exp, sin, cos are possible): 


p,q == «\|c|pt+q|p-q 


The grammar of dL formulas is as follows, where ~ € {=,4,>,>,<,<}isa 
comparison operator and a is a hybrid program: 


pp s=prq|oAY|oVY| 4 | Vvd| we | [ald | (ade 


This grammar features atomic comparisons (p ~ q), propositional connectives 
(=, A, V), first-order quantifiers over the reals (Y, 4), and the box ([a]@) and 
diamond ((@)@) modality formulas which express that all or some runs of hybrid 
program & satisfy ġ, respectively. The modalities [-],(-) can be freely nested 
with first-order and modal connectives, which is crucial for the specification of 
stability properties in Sections 3 and 4. Formulas not containing the modalities 
are formulas of first-order real arithmetic and are written as P,Q, R. 

This paper focuses on the continuous fragment of hybrid programs a = 
x’ = f(x) &Q, where 2’ = f(x) is an n-dimensional system of ordinary differen- 
tial equations (ODEs), 7) =fi(«),...,2),=fn(x), over variables x = (#1,...,2n), 
the LHS zx; is the time derivative of x; and the RHS f;() is a polynomial over 
variables x. The evolution domain constraint Q specifies the set of states in 
which the ODE is allowed to evolve continuously. When Q is the formula true, 
the ODE is also written as x’ = f(x). For n-dimensional vectors x,y, the dot 
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def def : 
product is zey = Yi ziyi and ||a||? = 5i x2 denotes the squared Euclidean 


norm. Variables z € V \ {x} not occurring on the LHS of ODE 2’ = f(x) are 
parameters that remain constant along ODE solutions. The following parametric 
ODE model of a simple pendulum is used as a running example. 


Example 1 (Pendulum model). The ODE a, = 6 = w, w = —4sin(@) — bw 
models a pendulum (illustrated below) suspended from a pivot by a rod of length 
L, where @ is the angle of displacement, w is the angular velocity of the pendulum, 
and g > 0 is the gravitational constant. Parameter a = # is a positive scaling 
constant and parameter b > 0 is the coefficient of friction for angular velocity. 
The symbolic parameters a,b make analysis of a, apply to a range of concrete 
values, e.g., pendulums that are suspended by a long rod (with large L) are 
modeled by small positive values of a, while frictionless pendulums have b = 0. 

A simplification of a, is used because stability analyses of- 
ten concern the behavior of the pendulum near its resting (or 
inverted) state where 0 = 0. For such nearby states with 6 ~ 0, 
the small angle approximation sin(@) ~ @ yields a linear ODE:? 


a = 0 =w, w = —a — bw (1) 


An inverted pendulum is modeled by a similar ODE (illus- 
trated on the right) under a change of coordinates. Such a pen- 
dulum requires an external torque input u(6,w) to maintain its 
stability; u(@,w) is determined and proved correct in Section 5. 


‘=w, w = af — bw — u(0, w) (2) 


a;=0 


States v : V > R assign real values to each variable in V; the set of all states 
is S. The semantics of dL formula ¢ is the set of states [¢] C S in which ¢ is 
true [26,27], where the semantics of first-order logical connectives are defined as 
usual, e.g., [6A W] = [¢] A [y]. For ODEs, the semantics of the modal operators 
is as follows.” Let v € S and ọ : [0,T) — S for some 0 < T < 00, be the unique, 
right-maximal solution [6] to ODE a’ = f(x) with initial value p(0) = v: 


v E [[a’ = f(x) & Q]¢] iff for all 0 < r < T where ~(¢) € [Q] for all 0 < ¢ < 7: 
ẹ(7) € [o] 
v € [(a’ = f(x) & Q)¢] iff there exists 0 < 7 < T such that: 
@(T) € [¢] and @(¢) € [Q] for alO<¢ <7 
For a formula P the ¢e-neighborhood of P with respect to x is defined as 


U.(P) 2 Jy (|æ — yl|? < £? A P(y)), where the existentially quantified variables 


y are fresh in P. The neighborhood formula U. (P) characterizes the set of states 
within distance £ from P, with respect to the dynamically evolving variables x. 


? This linearization is justified by the Hartman-Grobman theorem [6]. A nonlinear 


polynomial approximation, such as sin(@) ~ 0 — a can also be used. 


3 The semantics of dL formulas is defined compositionally elsewhere [26,27]. 
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This is useful for syntactically expressing small ¢ perturbations in the stability 
definitions of Sections 3 and 4. For formulas P of first-order real arithmetic, the 
e-neighborhood, U-(P), can be equivalently expressed in quantifier-free form by 
quantifier elimination [4]. For example, U(x = 0) is equivalent to the formula 
|z|]? < £?. Formulas P and OP are the syntactically definable topological closure 
and boundary of the set characterized by P, respectively [4]. 


Proof Calculus. All derivations and proof rules are presented in a classical 
sequent calculus. The semantics of sequent I+ @ is equivalent to the formula 
( Aue rv) > ¢. A sequent is valid iff its corresponding formula is valid. Com- 
pleted branches in a sequent proof are marked with x. Assumptions Y € I" that 
have only ODE parameters as free variables remain true along ODE evolutions 
and are soundly kept across ODE deduction steps [26,27]. First-order real arith- 
metic is decidable [4] so we assume such a decision procedure and label proof 
steps with R when they follow from real arithmetic. Axioms and proof rules are 
derivable iff they can be deduced from sound dL axioms and proof rules [26,27]. 
Formula J is an invariant of the ODE a’ = f(x) &Q iff the formula I > 
|x" = f(x) & QJ]I is valid. The dL proof calculus is complete for ODE invari- 
ants [28], i.e., any true ODE invariant expressible in first-order real arithmetic 
can be proved in the calculus. The calculus also supports refinement reason- 
ing [36] for proving ODE liveness properties P + (a’ = f(x) & Q)R, which says 
that the goal R is reached along the ODE a’ = f(x) & Q from precondition P. 


An important syntactic tool for reasoning with ODE qz’ = f(x) is the Lie 


os . def . . 
derivative of term p defined as p = ae £2 fila), whose semantic value is 


equal to the time derivative of the value of p along solutions @ of the ODE [26,28]. 
They are provably definable in dL using syntactic differentials [26]. 


3 Asymptotic Stability of an Equilibrium Point 


This section presents Lyapunov’s classical notion of asymptotic stability [19] 
and its formal specification in dL. This formalization enables the derivation of 
dL stability proof rules with Lyapunov functions [14,18,19,31]. Several related 
stability concepts are formalized in dL, along with their relationships and rules. 


3.1 Mathematical Preliminaries 


An equilibrium point of ODE a’ = f(x) is a point xo € R” where f(x) = 0, soa 
system that starts at xp stays at zo along its continuous evolution. Such points 
are often interesting in real-world systems, e.g., the equilibrium point 6 = 0,w = 
0 for a; from (1) is the resting state of a pendulum. For a controlled system, 
equilibrium points often correspond to desired steady system states where no 
further continuous control input (modeled as part of f(x)) is required [18]. 

For brevity, assume the origin 0 € R” is an equilibrium point of interest. Any 
other equilibrium point(s) of interest zo € R” can be translated to the origin 
with the change of coordinates z > x — a9 for the ODE (see supplement [35}). 
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Fig. 3. Solutions from points in the 6 ball around the origin, like the green initial point 
x, remain within the £ ball around the origin 0 € R” (black dot) and asymptotically 
approach the origin. The latter two plots illustrate how asymptotic stability for an ODE 
can be broken down into a pair of (quantified) ODE safety and liveness properties. 


The following definition of asymptotic stability is standard [14,18,31].4 


Definition 2 (Asymptotic stability [14,18,31]). The origin 0 € R” of ODE 
z= f(a) is 
— stable if, for alle > 0, there exists 6 > 0 such that for all initial states 
x = x(0) with ||x|| < ô, the right-mazimal ODE solution x(t) : [0, T) > R” 
satisfies ||x(t)|| < £ for all times0<t<T, 
— attractive if there exists 6 > 0 such that for all x = x(0) with ||x|| < ô, the 
right-mazimal ODE solution x(t) : [0,T) > R” satisfies lim; „r x(t) = 0, 
— asymptotically stable if it is stable and attractive. 


These definitions can be understood using the resting state of the pendulum 
from Fig. 1 (bottom) which is asymptotically stable. When the pendulum is given 
a light push from its bottom resting state (formally, ||x|| < 6), it gently oscillates 
near that resting state (formally, ||x(t)|| < £). In the presence of friction, these 
oscillations eventually dissipate so the pendulum asymptotically returns to its 
resting state (formally, lim,;.7 x(t) = 0). This behavior is local, i.e., for any given 
E > 0, there exists a sufficiently small 6 > 0 perturbation of the initial state that 
results in gentle oscillations with ||x(t)|| < £, see Fig.3 (left). A strong push, 
e.g., with ô > £, could instead cause the pendulum to spin around on its pivot. 


Remark 8. Stability and attractivity do not imply each other [31, Chapter 1.2.7]. 
However, if the origin is stable, attractivity can be defined in a simpler way. This 
is proved in dL, after characterizing stability and attractivity syntactically. 


3.2 Formal Specification 


The formal specification of asymptotic stability in dL combines i) the dynamic 
modalities of dL, which are used to quantify over the dynamics of the ODE, and 
ii) the first-order logic quantifiers, which are used to express combinations of 
(topologically) local and asymptotic properties of those dynamics. 


4 Some definitions require, or implicitly assume, right-maximal solutions x(t) to be 
global, i.e., with T = oo, see [18, Definition 4.1] and associated discussion. The 
definitions presented here are better suited for subsequent generalizations. 
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Lemma 4 (Asymptotic stability in dL). The origin of ODE x' = f(x) is, 
respectively, i) stable, ii) attractive, and iti) asymptotically stable iff the 
dL formulas i) Stab(a’ = f(a)), ii) Attr(a’ = f(x)), and iii) AStab(a’ = f(x)) 
respectively are valid. Variables £, are fresh, i.e., not in x, f(x). 
Stab(2’ = f(x)) = Ve>046>0Ve (Us(x = 0) > |x" = f(x)]U-(x = 0)) 
Attr(2’ = f(x)) = I8>0 Vx (Us(x = 0) > Asym(a’ = f(x), 2 = 0)) 
AStab(2’ = f(x)) = Stab(2’ = f(x)) A Attr(a’ = f(x)) 
Formula Asym(a’ = f(x), P) = Ve>0 (a! = f(x))[2’ = f(a)|Ue(P) charac- 
terizes the set of states that asymptotically approach P along ODE solutions. 


Formula Stab(a2’ = f(x)) is a syntactic dL rendering of the corresponding 
quantifiers from Def. 2. The safety property Us(x = 0) > |x" = f(x)]U.(x = 0) 
expresses that solutions starting from the d-neighborhood of the origin always 
(for all times) stay safely in the e-neighborhood, as visualized in Fig. 3 (middle). 

Formula Attr(a’ = f(a)) uses the subformula Asym(a’ = f(x), x = 0) which 
characterizes the limit in Def. 2. Recall lim,;,7 z(t) = 0 iff for all £ > 0 there 
exists a time 7 with 0 < 7 < T such that for all times t with rT < t < T, 
the solution satisfies ||x(t)|| < £, i.e., the limit requires for all distances € > 0, 
the ODE solution will eventually always be within distance £ of the origin, as 
visualized in Fig. 3 (right). This limit is characterized using nested (-) |-] modali- 
ties, together with first-order quantification according to Def. 2. More generally, 
formula Asym(2’ = f(x), P) characterizes the set of initial states where the 
right-maximal ODE solution asymptotically approaches P; this set is known as 
the region of attraction of P [18]. Thus, attractivity requires that the region of 
attraction of the origin contains an open neighborhood U5(x = 0) of the origin. 

From Lemma 4, proving validity of the formula AStab(a’ = f(a)) yields a 
rigorous proof of asymptotic stability for x’ = f(a). However, if the origin is 
stable, then attractivity can be provably simplified with the following corollary. 


Corollary 5 (Stable attractivity). The following axiom is derivable in dL. 
SAttr Stab(2’ = f(x)) > (Asym(a2’ = f(x), r=0)Ve>0 (x' = f(x)) U- (x=0)) 


Corollary 5 simplifies the syntactic characterization of the region of attrac- 
tion for stable equilibria from a nested (-)[-] formula to a (-) formula, which is 
then directly amenable to ODE liveness reasoning [36]. This corollary is used to 
simplify proofs of asymptotic stability, as explained next. 


3.3 Lyapunov Functions 


Lyapunov functions are the standard tool for showing stability of general, non- 
linear ODEs [14,18,31] and finding suitable Lyapunov functions is an important 
problem in its own right [1,9,12,17,21,23,24,33,37]. This section shows how a 
candidate Lyapunov function, once found, can be used to rigorously prove sta- 
bility. The following proof rules derive Lyapunov stability arguments [14,18,31] 
syntactically in dL. 
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Lemma 6 (Lyapunov functions). The following Lyapunov function proof 
rules are derivable in dL. 


+ f(0) =0Av(0)=0 F Ay>0Vz (0<||z|?<7? >v >0A0 < 0) 
Lyap> 
j F Stab(x' = f(x)) 
- f(0) =0A v0) =0 F 3y>0Vz (0<llæl? <7 >v > 0^% < 0) 
Lyap> 


F AStab(a! = f(x)) 


Rules Lyaps, Lyap, use the Lyapunov function v as an auxiliary, energy- 
like function near the origin which is positive and has non-positive (resp. nega- 
tive Lyap>) derivative ù. This guarantees that v is non-increasing (resp. decreas- 
ing) along ODE solutions near the origin, see Fig. 2. The right premise of both 
rules use 3y>0VYx (0<||x||? <7? — ---) to require that the Lyapunov function 
conditions are true in a y-neighborhood of the origin. The subtle difference in 
sign condition for ù between rules Lyap>, Lyap» is illustrated for the pendulum. 


Example 7 (Pendulum asymptotic stability). For ODE a; from (1), a suitable 
Lyapunov function for proving its stability [18] is v = a% + Oa SE, where 
the Lie derivative of v along a; is ù = — $ (a8? +w?). Stability” is formally proved 
in dL for any parameter values a > 0,b > 0 using rule Lyap> because both of 
its resulting arithmetical premises are provable by R. The full dL derivation, also 
used in KeYmaera X (Section 5), is shown in the proof of Lemma 6 [35]. 

When b > 0, i.e., friction is non-negligible, an identical derivation with Lyap> 
instead of Lyap> proves asymptotic stability because —} (a0? +w?) is negative 
except at the origin. Indeed, displacements to the pendulum’s resting state can 
only be dissipated in the presence of friction, not when b = 0. 


3.4 Asymptotic Stability Variations 


Asymptotic stability is a strong guarantee about the local behavior of ODE 
solutions near equilibrium points of interest. In certain applications, stronger 
stability guarantees may be needed for those equilibria [18]. This section exam- 
ines two standard stability variations, shows how they can be proved in dL, and 
formally analyzes their logical relationship with asymptotic stability. 


Exponential stability As the name suggests, the first stability variation, ex- 
ponential stability, guarantees an exponential rate of convergence towards the 
equilibrium point from an initial displacement. This is useful, e.g., for bounding 
the time spent by a perturbed system far away from its desired operating state. 


Definition 8 (Exponential stability [14,18,31]). The origin 0 € R” of ODE 
x’ = f(x) is exponentially stable if there are positive constants a, B, ô > 0 such 
that for all initial states x = x(0) with ||x|| < 6, the right-mazimal ODE solution 
x(t): [0, T) > R” satisfies ||x(t)|| < alx (0)|| exp (—8t) for all times 0 < t < T. 


5 For the trigonometric pendulum ODE a, from Example 1, the Lyapunov function 
2 2 . 
v = a(1 — cos(0)) 4 oto) +“" with Lie derivative ù = — è (a0 sin(0) +w?) proves its 
stability [18] but requires arithmetic reasoning over trigonometric functions. 
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Exponential stability bounds the norm of solutions to ODE 2’ = f(x) near 
the origin by a decaying exponential. It is specified in dL as follows. 


Lemma 9 (Exponential stability in dL). The origin of ODE x' = f(x) is 
exponentially stable iff the following dL formula is valid. Variables a, B, ô, y 
are fresh, i.e., not in x, f(a). 


EStab(a’ = f(x)) =Ja>048>036>0V2 (Us(x = 0) > 
[y= a7 |[a||7;2’ = f(x), y! = —26y] æl? < y) 


The discrete assignment y := a?||a||? sets the value of variable y to that of a?||x\|? 
and ; denotes sequential composition of hybrid programs [26,27]. 


Formula EStab(2’ = f(a)) uses a fresh variable y with ODE y’ = —28y 
and initialized to a?||x||? so that y differentially axiomatizes [28] the (squared) 
decaying exponential function a?||a(0)||? exp (—26t) along ODE solutions. Such 
an implicit (polynomial) characterization of exponential decay allows syntactic 
proof steps to use decidable real arithmetic reasoning. 


Lemma 10 (Lyapunov function for exponential stability). The following 
Lyapunov function proof rule for exponential stability is derivable in dL, where 
kı, k2, k3 E€ Q are positive constants. 
E Iy>0Ye (æ? <y? > kille? < v < KBilal|? Aù < —2k3v) 

F EStab(a’ = f(x)) 


Lyapg 


Rule Lyappg enables proofs of exponential stability in dL. In fact, the proof 
of Lemma 10 (see supplement [35]) yields concrete, quantitative pounds where 
EStab(a’ = f(x)) is explicitly witnessed with scaling constant a = fe 2 and decay 
rate 6 = kg. These can be used to calculate time bounds when ‘the system 
state will return sufficiently close to the origin. Eanes the disturbance 6 in 
EStab(2’ = f(a)) is quantitatively witnessed by # je? for any y witnessing validity 
of the premise of rule Lyapp. This yields a provable estimate of the region around 
the origin where exponential stability holds; this latter estimate is explored next. 


Region of attraction Formulas Attr(a’ = f(x)) and EStab(a2’ = f(a)) both 
feature a subformula of the form 46 > 0Va (Us(x = 0) > ---) which expresses 
that attractivity (or exponential stability) is locally true in some 6 neighborhood 
of the origin. In many applications, it is useful to find and rigorously prove that 
a given set is attractive or exponentially stable with respect to the origin [18, 
Chapter 8.2]. The second stability variation yields provable subsets of the region 
of attraction, including the special case where it is the entire state space. This is 
formalized using the following variants of Attr(2’ = f(x)) and EStab(2’ = f(z)) 
within a region given by a formula P. 


Attr® (x' = f(x), P) = Yx (P > Asym(2’ = f(x), £ = 0)) 
EStab” (x' = f(x), P) = da>058>0V2 (P > 
ly := a° |æ’; x" = f(x), y = —26y] lll? < y) 
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The formula Attr?(2’ = f(x), P) is valid iff the set characterized by P is 
a subset of the origin’s region of attraction [18]. For example, Attr(a’ = f(a)) 
is 36 > 0 Attr” (x! = f(x),Us(2 = 0)). This generalization is useful for for- 
malizing stronger notions of stability in dL, such as the following global stability 
notions [14,18]. For brevity, dL specifications of the stability properties (in bold) 
are given below with mathematical definitions deferred to the supplement [35]. 


Lemma 11 (Global stability in dL). The origin of ODE x' = f(x) is glob- 
ally asymptotically stable iff the dL formula Stab(x’ = f(x)) A Attr? (x! = 
f(a), true) is valid. The origin is globally exponentially stable iff the dL for- 
mula EStab? (x' = f(x), true) is valid. 


Global stability ensures that all perturbations to the system state are even- 
tually dissipated. Their proof rules are similar to Lyap> and Lyapp respectively. 


Lemma 12 (Lyapunov function for global stability). The following Lya- 
punov function proof rules for global asymptotic and exponential stability are 
derivable in dL. In rule Lyap$, kı, k2, k3 € Q are positive constants. 


H f(0)=0Av(0)=0 r#0F v>0A0<0 F YbIy>0Vzr (v<b—>U,(x=0)) 


Lyap¢ 
H Stab(2’! = f(x)) A AttrP (2! = f(x), true) 
o F kêllell? < v < Kllal|2 Aù < —2kgu 
Lyap 


+ EStabP (a! = f(x), true) 


Example 13 (Pendulum global exponential stability). For simplicity, instantiate 


Example 7 with parameters a = 1,b = 1. The Lyapunov function then simplifies 


2 2 2 2 2 
e+ Oru) tw" es, which satisfies the real 


tov= with Lie derivative ù = 


arithmetic inequalities tu? < v < 0? +w? and ù < —4v. Thus, rule Lyap& 
proves global exponential stability of a; with kı = 5, k2 = 1, and k3 = i. An 
important caveat is that Example 7 used a local small angle approximation, so 
this global phenomenon does not hold for a real world pendulum (nor for a,). 


Logical relationships With the proliferation of stability variations just in- 
troduced, it is useful to take stock of their logical relationships. An important 
example of such a relationship is shown in the following corollary. 


Corollary 14 (Exponential stability implies asymptotic stability). The 
following axioms are derivable in dL. 


EStabStab EStab(2’ = f(x)) > Stab(a’ = f(zx)) 
EStabAttr EStab? (x' = f(x), P) > Attr?(2’ = f(x), P) 
Derived axioms EStabStab, EStabAttr show that exponential stability im- 


plies asymptotic stability. In proofs, EStabAttr allows the region of attraction 
to be estimated using the region where solutions are exponentially bounded. 
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4 General Stability 


This section provides stability definitions and proof rules that generalize stability 
for an equilibrium point from Section 3 to the stability of sets. These definitions 
are useful when the desired stable system state(s) is not modeled by a single 
equilibrium point, but may instead, e.g., lie on a periodic trajectory [18], a 
hyperplane, or a continuum of equilibrium points within the state space [14]. 
The generalized definition is used to formalize two stability notions from the 
literature [14,18], and to justify their Lyapunov function proof rules. 


4.1 General Stability and General Attractivity 


The following general stability formula defines stability in dL with respect to an 
ODE wa’ = f(x) and formulas P, R. The quantified variables ¢,5 are assumed to 
be fresh by bound renaming, i.e., do not appear in x, f(x), P or R. 


Stabk (x = f(a), P, R) =Ve>036>0 Vz (Us(P) > [2’ = f(x)]Ue(R)) 


This formula generalizes stability of the origin Stab(a’ = f(x)) by adding two 
logical tuning knobs that can be intuitively understood as follows. The precon- 
dition P characterizes the initial states from which the system state is expected 
to be disturbed by some disturbance 6. The postcondition R characterizes the 
set of desired operating states that the system must remain close (within the € 
neighborhood of R) after being disturbed from its initial states. 

The general attractivity formula similarly generalizes Attr?(2’ = f(a), P) 
with a postcondition R towards which the ODE solutions from initial states 
satisfying precondition P are asymptotically attracted. 


AttrR(2’ = f(z), P, R) = Vz (P > Asym(z’ = f(x), R)) 


Lemma 15 (General Lyapunov functions). The following Lyapunov func- 
tion proof rule for general stability with two stacked premises is derivable in dL. 
EP >R 

Vz (O(U,(R)) > v > k)^ 
F yYe>030<7<c 3k | 30<8<y Vz (Us(P) > RV v<k)^ 

Va (RVv<k = |x" = f(x) &U,(R)|(RVv<k)) 

E- Stab (x' = f(x), P, R) 


GLyap 


Rule GLyap proves general stability for precondition P and postcondition 
R. It generalizes the Lyapunov function reasoning underlying rule Lyap> to 
support arbitrary pre- and postconditions. The conjunct Yz (O(U,(R)) > v > k) 
requires v>k on the boundary of U, (R) while the middle conjunct requires v<k 
for some small neighborhood of P excluding R. The conjunct Va (RVv<k >- ) 
asserts that R Vv < k is an invariant of the ODE within closed domain U4 (R). 
When R is a formula of first-order real arithmetic, this invariance question is 
provably equivalent in dL to a formula of real arithmetic [28], so the premise 
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of rule GLyap is, in theory, decidable by R for a given candidate Lyapunov 
function v. In practice, it is prudent to consider specialized stability notions, for 
which the premise of rule GLyap can be arithmetically simplified. Proof rules 
for generalized attractivity are also derivable for specialized instances. 


4.2 Specialization 


General stability specializes to several stability notions in the literature. For 
brevity, dL specifications of the stability properties (in bold) are given below 
with mathematical definitions deferred to the supplement [35]. 


Set Stability An important special case is when the desired operating states 
are exactly the states from which disturbances are expected, i.e., R = P. This 
leads to the notion of set stability of the set characterized by P [14,18]. 


Lemma 16 (Set Stability in dL). For the ODE 2’ = f(x), the set character- 
ized by formula P is i) stable, ii) attractive, iii) asymptotically stable, and 
iv) globally asymptotically stable iff the following dL formulas are valid: 

i) Stab (a’ = f(s ), P,P), 

ii) 36>0 Attr&(a’ = f(x),Us(P), P), 
iii) Stabp (2” = = f(x),P, ee Attri (x! = f(x),Us(P), P), and 

iv) Stabh (x' = f(x), P, P) A Attr&(a’ aye ), true, P) 


The intuition for Lemma 16 is similar to Lemmas 4 and 11, except formula 
P (instead of the origin) characterizes the set of desirable states. An application 
of set stability is shown in the following example. 


Example 17 (Tennis racket theorem [3]). The following system of ODEs models 
the rotation of a 3D rigid body [6,14], where 71, £2, £3 are angular velocities and 
I, > Ig > Íg > 0 are the principal moments of inertia along the respective axes. 
a eae r= BA r h-hh 
qi 

When such a rigid object is spun or rotated on each of its axes, a well-known 
physical curiosity [3] is that the rotation is stable in the first and third axes, 
whilst additional (unstable) twisting motion is observed for the intermediate 
axis. Mathematically, a perfect rotation, e.g., around x1, corresponds to a (large) 
initial value for x, with no rotation in the other axes, i.e., rg = 0, 73 = 0. 
Accordingly the real world observation of stability for rotations about the first 
principal axis is explained by stability with respect to small perturbations in 
T2, £3, as formally specified by formula (3) below. Note that the set characterized 
by formula z2 = 0A a3 = 0 is the entire x, axis, not just a single point. Similarly, 

rotations are stable around the third principal axis iff formula (4) is valid. 


Stab (ar, £2 =0A23 = 0,22 = 0 A 23 = 0) (3) 
Stabi (ar, 21 = 0 A x2 = 0,21 = 0 A 22 = 0) (4) 


The validity of formulas (3) and (4) are proved in Example 20. 
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The formal specification of set stability yields three provable logical conse- 
quences which are important stepping stones for the set stability proof rules. 


Corollary 18 (Set stability properties). The following axioms are derivable 
in dL. In axiom SClosure, formula P characterizes the topological closure of 
formula P. In axiom SClosed, formula P characterizes a closed set. 

Peo 
Seishin POR Sale, = ; 
> (Asym(2’ = f(z), P) 4 Ve>0 (x' = f(x))Ue(P)) 


SClosure Stab (x’ = f(x), P, P) 4 Stab (x' = f(x), P, P) 


SClosed Stabh(x' = f(x), P, P) > Va (P > [z' = f(x)]P) 


Axiom SetSAttr generalizes SAttr and provides a syntactic simplification of 
the region of attraction for formula P when P is stable. Axiom SClosure says 
that stability of P is equivalent to stability of its closure P, because for any 
perturbation 6 > 0, the neighborhoods Us (P) and U;(P) are provably equivalent 
in real arithmetic. Axiom SClosed says that for closed formulas P, invariance 
of P is a necessary condition for stability of P. Without loss of generality, it 
suffices to develop proof rules for stability of formulas characterizing closed (using 
SClosure) and invariant (using SClosed) sets. Indeed, standard definitions of set 
stability [14,18] usually assume that the set of concern is closed and invariant. 


Lemma 19 (Set stability Lyapunov functions). The following Lyapunov 
function proof rules for set stability are derivable in dL. In derived rules SLyap> 
and SLyaps, formula P characterizes a compact (i.e., closed and bounded) set. 
In derived rule SLyapš , the two premises are stacked. 


Pt la’ =f] =~Pku>0Av<0 OPFuv<0 


SLyapz F Stab? (x’ = f(x), P, P) 
PH [r =f@P -PKv>0A0<0 ƏPHvV<LO 
P> E Stabb (a! = f(a), P, P) n 3ô>0 Attr® (x' = f (£), Us (P), P) 
Pt Je" = f(x)|P 
ap (V2 (OU4(P)) > v > KA n 
+ Ye>030<7y<E 30<<y Yz (Us(P) AAP >v < k) 
: Va (U,(P) ^ -P + ù < 0) 
eevee F Stab? (x = f(a), P, P) 


All three proof rules have the necessary premise P F [x = f(x)]P which says 
that formula P is an invariant of the ODE a2’ = f(x). Rules SLyap>, SLyap> 
are slight generalizations of Lyapunov function proof rules for set stability [14] 
and they respectively generalize rules Lyap>, Lyap> to prove stability for an 
invariant P. Importantly, both rules assume that P characterizes a compact, 
i.e., closed and bounded set, which simplifies the arithmetical conditions on v in 
their premises. The rule without the boundedness requirement on P suggested 
in the remark after [18, Definition 8.1], is unsound, see supplement [35]. 
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For asymptotic stability (in rule SLyap»), boundedness also guarantees that 
perturbed ODE solutions always exist for sufficient duration, which is a funda- 
mental step in the ODE liveness proofs [36]. Rule SLyap% is derived from rule 
GLyap using invariance of P by the first premise; it provides a means of formally 
proving the set stability properties (3) and (4) from Example 17. 


Example 20 (Stability of rigid body motion). The proof for (3) uses the Lya- 
punov function v = Gam x3 — Ben x3), whose Lie derivative is ù = 0, and 
rule SLyap% with formula P = x2 = 0A x3 = 0. The proof for (4) is symmetric. 
For the top premise of rule SLyap%, formula P is a provable invariant [28] of 
the ODE a,. The bottom premise, although arithmetically complicated, can be 
simplified by choosing y = £ and deciding the resulting formula by R. 

Recall that the xı axis is not a compact set so neither of the standard proof 


rules for set stability SLyaps, SLyaps would be sound for this proof. 


Epsilon-Stability Motivated by numerical robustness of proofs of stability, 
Gao et al. [12] define e-stability for ODEs. The following dL characterization 
shows how ¢-stability can be understood as an instance of general stability. 


Lemma 21 (¢-Stability in dL). The origin of ODE x' = f(x) is e-stable for 
constant £ > 0 iff the dL formula StabR (a! = f(x), x =0,U.(x = 0)) is valid. 


Unlike set stability, -stability is an instance of general stability where the 
pre- and postconditions differ. In ¢-stability, systems are perturbed from the 
precondition x = 0 (the origin), but the postcondition enlarges the set of desired 
states to a £ > 0 neighborhood of the origin, which is considered indistinguish- 
able from the origin itself [12]. An immediate consequence of Lemma 21 is that 
rule GLyap can be used to prove ¢e-stability, as shown in the next section. 


5 Stability in KeYmaera X 


This section puts the dL stability specifications and derivations from the pre- 
ceding sections into practice through proofs for several case studies in the KeY- 
maera X theorem prover [11].° Examples 7, 13, 17, 20 have also been formalized. 
The insights from these proofs are discussed after an overview of the case studies. 


Inverted Pendulum. The stability of the resting state of the pendulum is in- 
vestigated in Examples 7 and 13. For the inverted pendulum a; from (2), the 
controlled torque u(@,w) must be designed and rigorously proved to ensure feed- 
back stabilization [18] of the inverted position. A standard PD (Proportional- 
Derivative) controller can be used for stabilization, where the control input has 
the form u(0,w) = k10 + kaw for tuning parameters kı, k2. Asymptotic stability 
of the inverted position is achieved for any control parameter choice where kı > a 
and kə > —b. The sequent a > 0,b > 0, kı > a,kg > —b F AStab(a;) is proved 
ise ae" + ee e 


in KeYmaera X using the Lyapunov function 


6 See https://github.com/LS-Lab/KeYmaeraX-projects/blob/master/stability 
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Frictional Tennis Racket Theorem. The stability of a 3D rigid body is inves- 
tigated for a, in Examples 17 and 20. The following ODEs model additional 
frictional forces that oppose the rotational motion in each axis of the rigid body, 
where Q1,Q2,a3 > 0 are positive coefficients of friction: 


=i: b-h om, gialirh 
Ti I; BHI] 242, 43 i 


In the presence of friction, rotations of the rigid body are globally asymptot- 
ically stable in the first and third principal axes, as proved in KeYmaera X. 


i 
LT2L3 A121, To= 


een | 
Qf = T15 T1ıT2 — Q3 T3 


r= h > b,b > Is, Ig > 0,@ı > 0,a2 > 0,a3 > 0 
I+ Stabk (ay, v2=0 A 23=0, x2=0 A 23=0) A AttrR (ay, true, t2=0 A 23=0) 
rH StabR (ay, £1=0 A #2=0, z1=0 A r2=0) A Attrk (ay, true, x1=0 A x2=0) 


Both asymptotic stability properties are proved using SLyap% and the live- 
ness property [36] that the kinetic energy Tx? + Inx3 + Izz? of the system tends 
to zero over time. The latter property implies that solutions of af exist glob- 
ally and that the values of £1, £2, £3 asymptotically tend to zero, which proves 
global asymptotic stability with the aid of SetSAttr. Even though a proof rule for 
(global) asymptotic stability of general nonlinear ODEs and unbounded sets is 
not available (Section 4), this example shows that formalized stability properties 
can still be proved on a case-by-case basis using dL’s ODE reasoning principles. 


Moore-Greitzer Jet Engine [12]. The origin of the ODE modeling a simpli- 
fied jet engine am = xv, = —x2 — r? — $a}, xb = 321 — x2 is e-stable for 
e = 1071 [12]. The sequent e = 107'° F Stab (am, £? + 23 = 0,2? + 23 < e?) 
is proved in KeYmaera X. The key proof ingredients are an ¢-Lyapunov func- 
tion [12] and manual arithmetic steps, e.g., instantiating existential quantifiers 


appearing in the specification of ¢-stability with appropriate values [12]. 


Other Examples [1]. Stability for several ODEs with Lyapunov functions gen- 
erated by an inductive synthesis technique |1, Examples 5-11] were successfully 
verified in KeYmaera X. The proof for the largest, 6-dim. nonlinear ODE [1, 
Example 5] required substantial manual arithmetic reasoning in KeYmaera X.” 

The arithmetical conditions in [1, Equation 1] are identical to the premises 
of rule Lyap> except it unsoundly omits the condition v(0) = 0, see supple- 
ment [35]. The generated Lyapunov functions remain correct because the induc- 
tive synthesis technique [1] implicitly guarantees this omitted condition. 


Summary. These case studies demonstrate the feasibility of carrying out proofs 
of various (advanced) stability properties within KeYmaera X using this paper’s 
stability specifications. The proofs share similar high-level proof structure, which 
suggests that proof automation could significantly reduce proof effort [10]. Such 
automation should also support user input of key insights for difficult reasoning 
steps, e.g., real arithmetic reasoning with nested, alternating quantifiers. 


7 The Lyapunov function in [1, Example 5] does not work for its associated ODE. It 
works if the ODE is corrected with #; = —x} +403 — 6x374, as in the literature [23]. 
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6 Related Work 


Stability is a fundamental property of interest across many different fields of 
mathematics [6,15,19,30,31,34] and engineering [14,18,20]. This related work dis- 
cussion focuses on formal approaches to stability of ODEs. 


Logical specification of stability. Rouche, Habets, and Laloy [31] provide a pio- 
neering example of using logical notation to specify and classify stability prop- 
erties of ODEs. Alternative logical frameworks have also been used to specify 
stability and related properties: stability is expressed in HyperSTL [22] as a hy- 
perproperty relating the trace of an ODE against two constant traces; e-stability 
is studied in the context of d-complete reasoning over the reals [12]; region sta- 
bility for hybrid systems [29] is discussed using CTL*; the syntactic specification 
of Asym(a’ = f(x), P) resembles the limit definition using filters [16]. This pa- 
per uses dL as a sweet spot logical framework, general enough to specify various 
stability properties of interest, e.g., asymptotic or exponential stability, and the 
stability of sets, while also enabling syntactic proofs of those properties. 


Formal verification of stability. There is a vast literature on finding Lyapunov 
functions for stability, e.g., through numerical [24,23,37] and algebraic meth- 
ods [9,21]. Formal approaches are often based on finding Lyapunov function can- 
didates and certifying the correctness of those generated candidates [1,12,17,33]. 
This paper’s approach enables highly trustworthy certification of those candi- 
dates in dL and KeYmaera X, with stability proof rules that are soundly de- 
rived from dL’s parsimonious axiomatization [25,26,27], as implemented in KeY- 
maera X [11,26]. Sections 4 and 5 further show that this paper’s approach sup- 
ports verification of advanced stability properties [12,14,18] within the same dL 
framework. New stability proof rules like GLyap can also be soundly and syntac- 
tically justified in dL without the need for (low-level) semantic reasoning about 
the underlying ODE mathematics. As an example of the latter, semantic ap- 
proach, LaSalle’s invariance principle is formalized in Coq [7] and used to verify 
the correctness of an inverted pendulum controller [32]. 


7 Conclusion 


This paper shows how ODE stability can be formalized in dL using the key idea 
that stability properties are V /4-quantified dynamical formulas. These speci- 
fications, their proof rules, and their logical relationships are all syntactically 
derived from dL’s sound proof calculus. This further enables trustworthy KeY- 
maera X proofs that rigorously verify every step in an ODE stability argument, 
from arithmetical premises down to dynamical reasoning for ODEs. Directions 
for future work include i) formalization of stability with respect to perturbations 
of the system dynamics, and ii) generalizations of stability to hybrid systems. 


Acknowledgments. We thank Brandon Bohrer, Stefan Mitsch, and the anony- 
mous reviewers for their helpful feedback on KeYmaera X and this paper. 
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Abstract. Deep learning has emerged as an effective approach for cre- 
ating modern software systems, with neural networks often surpassing 
hand-crafted systems. Unfortunately, neural networks are known to suffer 
from various safety and security issues. Formal verification is a promising 
avenue for tackling this difficulty, by formally certifying that networks 
are correct. We propose an SMT-based technique for verifying binarized 
neural networks — a popular kind of neural network, where some weights 
have been binarized in order to render the neural network more memory 
and energy efficient, and quicker to evaluate. One novelty of our tech- 
nique is that it allows the verification of neural networks that include 
both binarized and non-binarized components. Neural network verifica- 
tion is computationally very difficult, and so we propose here various 
optimizations, integrated into our SMT procedure as deduction steps, as 
well as an approach for parallelizing verification queries. We implement 
our technique as an extension to the Marabou framework, and use it to 
evaluate the approach on popular binarized neural network architectures. 


1 Introduction 


In recent years, deep neural networks (DNNs) have revolutionized the state 
of the art in a variety of tasks, such as image recognition (12)37], text classifica- 
tion BJ, and many others. These DNNs, which are artifacts that are generated 
automatically from a set of training data, generalize very well — i.e., are very 
successful at handling inputs they had not encountered previously. The suc- 
cess of DNNSs is so significant that they are increasingly being incorporated into 
highly-critical systems, such as autonomous vehicles and aircraft (7|[30]. 

In order to tackle increasingly complex tasks, the size of modern DNNs has 
also been increasing, sometimes reaching many millions of neurons [46]. Con- 
sequently, in some domains, DNN size has become a restricting factor: huge 
networks have a large memory footprint, and evaluating them consumes both 
time and energy. Thus, resource-efficient networks are required in order to allow 
DNNs to be deployed on resource-limited, embedded devices (23]/42]. 

One promising approach for mitigating this problem is via DNN quantiza- 
tion ae. Ordinarily, each edge in a DNN has an associated weight, typically 
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stored as a 32-bit floating point number. In a quantized network, these weights 
are stored using fewer bits. Additionally, the activation functions used by the 
network are also quantized, so that their outputs consist of fewer bits. The net- 
work’s memory footprint thus becomes significantly smaller, and its evaluation 
much quicker and cheaper. When the weights and activation function outputs 
are represented using just a single bit, the resulting network is called a binarized 
neural network (BNN) [26]. BNNs are a highly popular variant of a quantized 
DNN [10}|40]/56|[57), as their computing time can be up to 58 times faster, and 
their memory footprint 32 times smaller, than that of traditional DNNs [45]. 
There are also network architectures in which some parts of the network are 
quantized, and others are not [45]. While quantization leads to some loss of 
network precision, quantized networks are sufficiently precise in many cases [45]. 

In recent years, various security and safety issues have been observed in 
DNNs [33)[48}. This has led to the development of a large variety of verification 
tools and approaches (e.g., (16][25][33)[52], and many others). However, most of 
these approaches have not focused on binarized neural networks, although they 
are just as vulnerable to safety and security concerns as other DNNs. Recent work 
has shown that verifying quantized neural networks is PSPACE-hard [24], and 
that it requires different methods than the ones used for verifying non-quantized 
DNNs (18}. The few existing approaches that do handle binarized networks focus 
on the strictly binarized case, i.e., on networks where all components are binary, 
and verify them using a SAT solver encoding (29]/43]. Neural networks that are 
only partially binarized cannot be readily encoded as SAT formulas, and 
thus verifying these networks remains an open problem. 

Here, we propose an SMT-based approach and tool for the formal ver- 
ification of binarized neural networks. We build on top of the Reluplex algo- 
rithm [B3] P] and extend it so that it can support the sign function, 


f z<0O -1 
sign(x) = ued a 


We show how this extension, when integrated into Reluplex, is sufficient for ver- 
ifying BNNs. To the best of our knowledge, the approach presented here is the 
first capable of verifying BNNs that are not strictly binarized. Our technique 
is implemented as an extension to the open-source Marabou framework p]B4]. 
We discuss the principles of our approach and the key components of our imple- 
mentation. We evaluate it both on the XNOR-Net BNN architecture [45], which 
combines binarized and non-binarized parts, and on a strictly binarized network. 

The rest of this paper is organized as follows. In Section [| we provide the 
necessary background on DNNs, BNNs, and the SMT-based formal verification 
of DNNs. Next, we present our SMT-based approach for supporting the sign 
activation function in Section |3| followed by details on enhancements and opti- 
mizations for the approach in Section [4] We discuss the implementation of our 
tool in Section [5] and its evaluation in Section [6] Related work is discussed in 
Section [7| and we conclude in Section [8] 


3 is a recent extended version of the original Reluplex paper [31]. 
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2 Background 


Deep Neural Networks. A deep neural network (DNN) is a directed graph, 
where the nodes (also called neurons) are organized in layers. The first layer is 
the input layer, the last layer is the output layer, and the intermediate layers 
are the hidden layers. When the network is evaluated, the input neurons are 
assigned initial values (e.g., the pixels of an image), and these values are then 
propagated through the network, layer by layer, all the way to the output layer. 
The values of the output neurons determine the result returned to the user: 
often, the neuron with the greatest value corresponds to the output class that 
is returned. A network is called feed-forward if outgoing edges from neurons in 
layer i can only lead to neurons in layer j if j > i. For simplicity, we will assume 
here that outgoing edges from layer i only lead to the consecutive layer, i + 1. 

Each layer in the neural network has a layer type, which determines how the 
values of its neurons are computed (using the values of the preceding layer’s 
neurons). One common type is the weighted sum layer: neurons in this layer are 
computed as a linear combination of the values of neurons from the preceding 
layer, according to predetermined edge weights and biases. Another common 
type of layer is the rectified linear unit (ReLU) layer, where each node y is 
connected to precisely one node x from the preceding layer, and its value is 
computed by y = ReLU(a) = max(0, x). The maz-pooling layer is also common: 
each neuron y in this layer is connected to multiple neurons z1,..., £k from the 
preceding layer, and its value is given by y = max(a1,..., £k). 

More formally, a DNN N with k inputs and m outputs is a mapping R* > 
R”. It is given as a sequence of layers L1,..., Ln, where Lı and Lẹ, are the 
input and output layers, respectively. We denote the size of layer L; as s;, and 
its individual neurons as v;},...,v;'. We use V; to denote the column vector 
[vt,...,u;*]”. During evaluation, the input values Vj are given, and V2,..., Vn 
are computed iteratively. The network also includes a mapping Ty : N > 7, 
such that T(i) indicates the type of hidden layer i. For our purposes, we focus 
on layer types T = {weighted sum, ReLU, max}, but of course other types could 
be included. If T;,(¢) = weighted sum, then layer L; has a weight matrix W; of 
dimensions s; X 5;_; and a bias vector B; of size s;, and its values are computed 
as Vi = Wi - Vi-1 + Bi. For T, (4) = ReLU, the ReLU function is applied to 
each neuron, i.e. v? = ReLU(v}_,) (we required that s; = si—1 in this case). If 
T(t) = max, then each neuron vl in layer L; has a list src of source indices, 
and its value is computed as On = MAXkesre aa, 


A simple illustration appears in Input Weighted sum ReLU Output 
Fig. [I] This network has a weighted 1 ReLU 
sum layer and a ReLU layer as its at. —5 © © 
hidden layers, and a weighted sum >X< a a 
layer as its output layer. For the + a eo 

. A 1 ReLU 
weighted sum layers, the weights 42 


and biases are listed in the figure. 
On input V; = [1,2]", the first Fig. 1: A toy DNN. 
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layer’s neurons evaluate to V> = [6,—1]?. After ReLUs are applied, we get 
V3 = [6,0]”, and finally the output is V4 = [6]. 


Binarized Neural Net- ab 1 

works. Ina binarized neural >. 05 sign 2 

network (BNN), the layers ce -© -@ -@ 
Ae 


are typically organized into @ 
binary blocks, regarded as 


units with binary inputs and Fig. 2: A toy BNN with a single binary block com- 
outputs. Following the defi- posed of three layers: a weighted sum layer, a batch 
nitions of Hubara et al. normalization layer, and a sign layer. 

and Narodytska et al. [43], a 


binary block is comprised of three layers: (i) a weighted sum layer, where each 
entry of the weight matrix W is either 1 or —1; (ii) a batch normalization layer, 
which normalizes the values from its preceding layer (this layer can be regarded 
as a weighted sum layer, where the weight matrix W has real-valued entries in 
its diagonal, and 0 for all other entries); and (iii) a sign layer, which applies the 
sign function to each neuron in the preceding layer. Because each block ends 
with a sign layer, its output is always a binary vector, i.e. a vector whose entries 
are +1. Thus, when several binary blocks are concatenated, the inputs and out- 
puts of each block are always binary. Here, we call a network strictly binarized 
if it is composed solely of binary blocks (except for the output layer). If the 
network contains binary blocks but also additional layers (e.g., ReLU layers), we 
say that it is a partially binarized neural network. BNNs can be made to fit into 
our definitions by extending the set 7 to include the sign function. An example 
appears in Fig. |2} for input V; = [—1,3]", the network’s output is V = [—2]. 


aril | 


SMT-Based Verification of Deep Neural Networks. Given a DNN N that 
transforms an input vector x into an output vector y = N(x), a pre-condition 
P on z, and a post-condition Q on y, the DNN verification problem is to 
determine whether there exists a concrete input xo such that P(zo) AQ(N(20)). 
Typically, Q represents an undesirable output of the DNN, and so the existence 
of such an 29 constitutes a counterexample. A sound and complete verification 
engine should return a suitable xo if the problem is satisfiable (SAT), or reply 
that it is unsatisfiable (UNSAT). As in most DNN verification literature, we will 
restrict ourselves to the case where P and Q are conjunctions of linear constraints 
over the input and output neurons, respectively (16][33][52]. 

Here, we focus on an SMT-based approach for DNN verification, which was 
introduced in the Reluplex algorithm and extended in the Marabou frame- 
work BIBA. It entails regarding the DNN’s node values as variables, and the 
verification query as a set of constraints on these variables. The solver’s goal 
is to find an assignment of the DNN’s nodes that satisfies P and Q. The con- 
straints are partitioned into two sets: linear constraints, i.e. equations and vari- 
able lower and upper bounds, which include the input constraints in P, the 
output constraints in Q, and the weighted sum layers within the network; and 
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piecewise-linear constraints, which include the activation function constraints, 
such as ReLU or max constraints. The linear constraints are easier to solve 
(specifically, they can be phrased as a linear program [6], solvable in polynomial 
time); whereas the piecewise-linear constraints are more difficult, and render the 
problem NP-complete [83]. We observe that sign constraints are also piecewise- 
linear. 


In Reluplex, the linear constraints are solved iteratively, using a variant of the 
Simplex algorithm (13}. Specifically, Reluplex maintains a variable assignment, 
and iteratively corrects the assignments of variables that violate a linear con- 
straint. Once the linear constraints are satisfied, Reluplex attempts to correct any 
violated piecewise-linear constraints — again by making iterative adjustments 
to the assignment. If these steps re-introduce violations in the linear constraints, 
these constraints are addressed again. Often, this process converges; but if it 
does not, Reluplex performs a case split, which transforms one piecewise-linear 
constraint into a disjunction of linear constraints. Then, one of the disjuncts 
is applied and the others are stored, and the solving process continues; and if 
UNSAT is reached, Reluplex backtracks, removes the disjunct it has applied and 
applies a different disjunct instead. The process terminates either when one of 
the search paths returns SAT (the entire query is SAT), or when they all return 
UNSAT (the entire query is UNSAT). It is desirable to perform as few case splits as 
possible, as they significantly enlarge the search space to be explored. 


The Reluplex algorithm is formally defined as a sound and complete calculus 
of derivation rules [83]. We omit here the derivation rules aimed at solving the 
linear constraints, and bring only the rules aimed at addressing the piecewise- 
linear constraints; specifically, ReLU constraints [83]. These derivation rules are 
given in Fig. B| where: (i) ¥ is the set of all variables in the query; (ii) R is the set 
of all ReLU pairs; i.e., (b, f) € R implies that it should hold that f = ReLU(b); 
(iii) a is the current assignment, mapping variables to real values; (iv) l and u 
map variables to their current lower and upper bounds, respectively; and (v) the 
update(a,x,v) procedure changes the current assignment a by setting the value 
of x to v. The ReluCorrect, and ReluCorrecty rules are used for correcting an 
assignment in which a ReLU constraint is currently violated, by adjusting either 
the value of b or f, respectively. The ReluSplit rule transforms a ReLU constraint 
into a disjunction, by forcing either b’s lower bound to be non-negative, or its 
upper bound to be non-positive. This forces the constraint into either its active 
phase (the identity function) or its inactive phase (the zero function). In the 
case when we guess that a ReLU is active, we also apply the addEq operation 
to add the equation f = b, in order to make sure the ReLU is satisfied in the 
active phase. The Success rule terminates the search procedure when all variable 
assignments are within their bounds (i.e., all linear constraints hold), and all 
ReLU constraints are satisfied. The rule for reaching an UNSAT conclusion is 
part of the linear constraint derivation rules which are not depicted; see |33| for 
additional details. 


The aforementioned derivation rules describe a search procedure: the solver 
incrementally constructs a satisfying assignment, and performs case splitting 
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(b, f) E R, a(f) # ReLU(a(b)) 
a := update(a, b, a(f)) 


(b, f) E R, a(f) # ReLU(a(b)) 


ReluC t 
SE ee a := update(a, f, ReLU(a(b))) 


ReluCorrect ¢ 


(ofp eR 
u(b) := min(u(b), 0), 
(f) := max(I(f), 0), 
u(f) := min(u(f), 0) 


Va € X. (x) < a(x) < u(x), V(b, f) E€ R. a(f) = ReLU(a(b)) 
SAT 


ReluSplit l(b) := max(l(b), 0), 


addEq(f = b) 


Success 


Fig. 3: Derivation rules for the Reluplex algorithm (simplified; see for more 
details). 


when needed. Another key ingredient in modern SMT solvers is deduction steps, 
aimed at narrowing down the search space by ruling out possible case splits. 
In this context, deductions are aimed at obtaining tighter bounds for variables: 
i.e., finding greater values for l(x) and smaller values for u(x) for each variable 
x € X. These bounds can indeed remove case splits by fixing activation functions 
into one of their phases; for example, if f = ReLU(b) and we deduce that b > 3, 
we know that the ReLU is in its active phase, and no case split is required. We 
provide additional details on some of these deduction steps in Section 


3 Extending Reluplex to Support Sign Constraints 


In order to extend Reluplex to support sign constraints, we follow a similar 
approach to how ReLUs are handled. We encode every sign constraint f = sign(b) 
as two separate variables, f and b. Variable b represents the input to the sign 
function, whereas f represents the sign’s output. In the toy example from Fig. 
b will represent the assignment for neuron v3, and f will represent v4. 

Initially, a sign constraint poses no bound constraints over b, i.e. [(b) = 
—oo and u(b) = oo. Because the values of f are always +1, we set I(f) = —1 
and u(f) = 1. If, during the search and deduction process, tighter bounds are 
discovered that imply that b > 0 or f > —1, we say that the sign constraint 
has been fixed to the positive phase; in this case, it can be regarded as a linear 
constraint, namely b > 0A f = 1. Likewise, if it is discovered that b < 0 or f < 1, 
the constraint is fixed to the negative phase, and is regarded as b < OA f = —1. 
If neither case applies, we say that the constraint’s phase has not yet been fixed. 

In each iteration of the search procedure, a violated constraint is selected 
and corrected, by altering the variable assignment. A violated sign constraint is 
corrected by assigning f the appropriate value: —1 if the current assignment of b 
is negative, and 1 otherwise. Case splits (which are needed to ensure completeness 
and termination) are handled similarly to the ReLU case: we allow the solver to 
assert that a sign constraint is in either the positive or negative phase, and then 
backtrack and flip that assertion if the search hits a dead-end. 

More formally, we define this extension to Reluplex by modifying the deriva- 
tion rules described in Fig. |3} as follows. The rules for handling linear con- 
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(b,f) € S, a(b) <0, a(f) #-1 
a := update(a, f, —1) 


(o,f) € S, a(b) 20, a(f) #1 
a := update(a, f,1) 


SignCorrect _ SignCorrect , 


(6, f) ES 
ienSpli u(b) := min(u(b), —e), l(b) := max(l(b), 0), 
SienSplt 17) := max(I(f),-1), I(f) = max(I( f), 1), 
u(f) := min(u(f), —1) u(f) := min(u(f), 1) 


Va E€ X. I(x) < a(x) < u(x), 
Success V(b, f) € S. a(f) =sign(a(b)), V(b, f) € R. a(f) = ReLU(a(b)) 
SAT 


Fig. 4: The extended Reluplex derivation rules, with support for sign constraints. 


straints and ReLU constraints are unchanged — the approach is modular and 
extensible in that sense, as each type of constraint is addressed separately. In 
Fig. |4| we depict new derivation rules, capable of addressing sign constraints. 
The SignCorrect_ and SignCorrect, rules allow us to adjust the assignment of f 
to account for the current assignment of b — i.e., set f to —1 if b is negative, 
and to 1 otherwise. The SignSplit is used for performing a case split on a sign 
constraint, introducing a disjunction for enforcing that either b is non-negative 
(1(b) > 0) and f = 1, or b is negative (u(b) < —e; epsilon is a small positive con- 
stant, chosen to reflect the desired precision) and f = —1. Finally, the Success 
rule replaces the one from Fig. |3} it requires that all linear, ReLU and sign 
constraints be satisfied simultaneously. 

We demonstrate this process with a simple example. Observe again the toy 
example for Fig. [2] the pre-condition P = (1 < vt < 2)A(—1 < v? < 1), and the 
post-condition Q = (vz < 5). Our goal is to find an assignment to the variables 
{v}, v7, v}, v4, vł, vi} that satisfies P, Q, and also the constraints imposed by 
the BNN itself, namely the weighted sums v = vł — v? +1, vi = 0.5v4, and 
vs = 2v}, and the sign constraint vj = sign(v4). 

Initially, we invoke derivation rules that 

address the linear constraints (see [83]), variable lvi v? v2 vg vj Us 
and come up with an assignment that assignment 1| 1 0 2 1 —1-—2 
satisfies them, depicted as assignment 1 assignment 2| 1 0 2 1 1 —2 
in Fig. [5] However, this assignment vi- assignment 3/1 0 2 1 1 2 
olates the sign constraint: vt = —1 4 

sign(v}) = sign(1) = 1. We can thus in- Fig.5: An iterative solution for a 
voke the SignCorrect, rule, which adjusts BNN verification query. 

the assignment, leading to assignment 2 

in the figure. The sign constraint is now satisfied, but the linear constraint 
vs = 2v1 is violated. We thus let the solver correct the linear constraints again, 
this time obtaining assignment 3 in the figure, which satisfies all constraints. 
The Success rule now applies, and we return SAT and the satisfying variable 
assignment. 

The above-described calculus is sound and complete (assuming the e€ used 
in the SignSplit rule is sufficiently small): when it answers SAT or UNSAT, that 
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statement is correct, and for any input query there is a sequence of derivation 
steps that will lead to either SAT or UNSAT. The proof is quite similar to that of the 
original Reluplex procedure [83], and is omitted. A naive strategy that will always 
lead to termination is to apply the SignSplit rule to saturation; this effectively 
transforms the problem into an (exponentially long) sequence of linear programs. 
Then, each of these linear programs can be solved quickly (linear programming 
is known to be in P). However, this strategy is typically quite slow. In the next 
section we discuss how many of these case splits can be avoided by applying 
multiple optimizations. 


4 Optimizations 


Weighted Sum Layer Elimination. The SMT-based approach introduces 
a new variable for each node in a weighted sum layer, and an equation to ex- 
press that node’s value as a weighted sum of nodes from the preceding layer. In 
BNNs, we often encounter consecutive weighted sum layers — specifically be- 
cause of the binary block structure, in which a weighted sum layer is followed by 
a batch normalization layer, which is also encoded as weighted sum layer. Thus, 
a straightforward way to reduce the number of variables and equations, and 
hence to expedite the solution process, is to combine two consecutive weighted 
sum layers into a single layer. Specifically, the original layers can be regarded as 
transforming input x into y = Wa (W1 - x + B1) + Bs, and the simplification as 
computing y = W; - x + B3, where W3 = W2- Wı and B3 = W2- Bı + Bo. An 
illustration appears in Fig. [6] (for simplicity, all bias values are assumed to be 0). 


Weighted Weighted Merged weighted 
sum layer #1 sum layer #2 sum layer 


-1 
~~ 6- 


=e e- Te- 


Fig. 6: On the left, a (partial) DNN with two consecutive weighted sum layers. 
On the right, an equivalent DNN with these two layers merged into one. 


LP Relaxation. Given a constraint f = sign(b), it is beneficial to deduce 
tighter bounds on the b and f variables — especially if these tighter bounds fix 
the constraints into one of its linear phases. We thus introduce a preprocessing 
phase, prior to the invocation of our enhanced Reluplex procedure, in which 
tighter bounds are computed by invoking a linear programming (LP) solver. 
The idea, inspired by similar relaxations for ReLU nodes MF is to over- 
approximate each constraint in the network, including sign constraints, as a set 
of linear constraints. Then, for every variable v in the encoding, an LP solver 
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is used to compute an upper bound u (by maximizing) and a lower bound | 
(by minimizing) for v. Because the LP encoding is an over-approximation, v is 
indeed within the range [l, u] for any input to the network. 

Let f = sign(b), and suppose we initially know that | < b < u. The linear 
over-approximation that we introduce for f is a trapezoid (see Fig. F), with the 
following edges: (i) f < 1; (ii) f > —1; (iii) f < 4 -b+ 1; and (iv) f > 2-0-1. 
It is straightforward to show that these four equations form the smallest convex 
polytope containing the values of f. 

We demonstrate this process on the simple BNN depicted on the left-hand 
side of Fig. |7| Suppose we know that the input variable, x, is bounded in the 
range —1 < x < 1, and we wish to compute a lower bound for y. Simple, interval- 
arithmetic based bound propagation shows that bı = 3x+1 is bounded in the 
range —2 < bı < 4, and similarly that bə = —4x +2 is in the range —2 < bo < 6. 
Because neither bı nor bə are strictly negative or positive, we only know that 
—1 < fi, fo <1, and so the best bound obtainable for y is y > —2. However, by 
formulating the LP relaxation of the problem (right-hand side of Fig. [7}, we get 
the optimal solution x = —4, bı = 0, b2 2, fi =—-1, f2 BY -3, implying 
the tighter bound y > —8. 


minimize y s.t.: 


sign -l<a<l 
3 -@. bı =3r41 
Z + N b2 = —44 + 2 
@ . ahh 
a i -1<fi<l 
> — 
@ 1<f <1 


%1 < fi <bı+1 
%21 < fa <bz+1 


Fig. 7: A simple BNN (left), the trapezoid relaxation of fı = sign(b1) (center), 
and its LP encoding (right). The trapezoid relaxation of fz is not depicted. 


The aforementioned linear relaxation technique is effective but expensive 
— because it entails invoking the LP solver twice for each neuron in the BNN 
encoding. Consequently, in our tool, the technique is applied only once per query, 
as a preprocessing step. Later, during the search procedure, we apply a related 
but more lightweight technique, called symbolic bound tightening 52], which we 
enhanced to support sign constraints. 


Symbolic Bound Tightening. In symbolic bound tightening, we compute 
for each neuron v a symbolic lower bound sl(x) and a symbolic upper bound 
su(x), which are linear combinations of the input neurons. Upper and lower 
bounds can then be derived from their symbolic counterparts using simple in- 
terval arithmetic. For example, suppose the network’s input nodes are x; and 
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z>, and that for some neuron v we have: 
sl(v) = 5x1 — 2a. +3, su(v) = 321 + 4rə — 1 


and that the currently known bounds are xı € [—1,2],a2 € [-1,1] and v € 
[—2, 11]. Using the symbolic bounds and the input bounds, we can derive that 
the upper bound of v is at most 6 + 4—1 = 9, and that its lower bound is at 
least —5 — 2 + 3 = —4. In this case, the upper bound we have discovered for v is 
tighter than the previous one, and so we can update v’s range to be [—2, 9]. 
The symbolic bound expressions are propa- 
gated layer by layer [52]. Propagation through 
weighted sum layers is straightforward: the sym- 
bolic bounds are simply multiplied by the re- 
spective edge weights and summed up. Efficient 
approaches for propagations through ReLU lay- 
ers have also been proposed |51|. Our contribu- 
tion here is an extension of these techniques for 
propagating symbolic bounds also through sign | 
layers. The approach again uses a trapezoid, al- 
though a more coarse one — so that we can ap- . È 
proximate each neuron from above and below us- Fig. 8: Symbolic bounds for 
ing a single linear expression. More specifically, f=sign(b). 
for f = sign(b) with b € [l, u] and previously-computed symbolic bounds su(b) 
and sl(b), the symbolic bounds for f are given by: 


sl(f) = -sl(b)— 1, su(f) = a su(b) +1 
An illustration appears in Fig.|8| The blue trapezoid is the relaxation we use for 
the symbolic bound computation, whereas the gray trapezoid is the one used for 
the LP relaxation discussed previously. The blue trapezoid is larger, and hence 
leads to looser bounds than the gray trapezoid; but it is computationally cheaper 
to compute and use, and our evaluation demonstrates its usefulness. 


Polarity-based Splitting. The Marabou framework supports a parallelized 
solving mode, using the Split-and-Conquer (S&C) algorithm [54]. At a high level, 
S&C partitions a verification query ¢ into a set of sub-queries ® := {¢1,...dy}, 
such that ¢ and V $e @’ are equi-satisfiable, and handles each sub-query in- 
dependently. Each sub-query is solved with a timeout value; and if that value 
is reached, the sub-query is again split into additional sub-queries, and each is 
solved with a greater timeout value. The process repeats until one of the sub- 
queries is determined to be SAT, or until all sub-queries are proven UNSAT. 

One Marabou strategy for creating sub-queries is by splitting the ranges of 
input neurons. For example, if in query ¢ an input neuron x is bounded in the 
range x € [0,4] and ¢ times out, it might be split into ¢; and ¢2 such that 
x € [0,2] in ¢; and z € [2,4] in 2. This strategy is effective when the neural 
network being verified has only a few input neurons. 
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Another way to create sub-queries is to perform case-splits on piecewise-linear 
constraints — sign constraints, in our case. For instance, given a verification 
query ¢ := ¢' A f = sign(b), we can partition it into dO := ¢ Ab<0Af=-1 
and ¢* := ¢' Ab>0A f =1. Note that ¢ and ¢* V @ are equi-satisfiable. 

The heuristics for picking which sign constraint to split on have a significant 
impact on the difficulty of the resulting sub-problems [54]. Specifically, it is 
desirable that the sub-queries be easier than the original query, and also that 
they be balanced in terms of runtime — i.e., we wish to avoid the case where ¢ 
is very easy and ¢2 is very hard, as that makes poor use of parallel computing 
resources. To create easier sub-problems, we propose to split on sign constraints 
that occur in the earlier layers of the BNN, as that leads to efficient bound 
propagation when combined with our symbolic bound tightening mechanism. 
To create balanced sub-problems, we use a metric called polarity, which was 
proposed in for ReLUs and is extended here to support sign constraints. 


Definition 1. Given a sign constraint f = sign(b), and the bounds l <b < u, 


where l <0, and u > 0, the polarity of the sign constraint is defined as p = “+4 


u—l* 


Intuitively, the closer the polarity is to 0, the more balanced the resulting 
queries will be if we perform a case-split on this constraint. For example, if 
$ = ¢'A-10 < b < 10 and we create ¢; = ¢'A-10 < b < 0, ¢2 = ¢/AN < b < 10, 
then queries ¢, and ¢2 are roughly balanced. However, if initially —10 < b < 1, 
we obtain ¢; = ¢’A\—-10<b<Oand gg =¢’A0 <b < 1. In this case, ¢2 might 
prove significantly easier than ¢1 because the smaller range of b in ¢2 could lead 
to very effective bound tightening. Consequently, we use a heuristic that picks 
the sign constraint with the smallest polarity among the first k candidates (in 
topological order), where k is a configurable parameter. In our experiments, we 
empirically selected k = 5. 


5 Implementation 


We implemented our approach as an extension to Marabou [84], which is an open- 
source, freely available SMT-based DNN verification framework gl. Marabou 
implements the Reluplex algorithm, but with multiple extensions and optimiza- 
tions — e.g., support for additional activation functions, deduction methods, and 
parallelization [54]. It has been used for a variety of verification tasks, such as 
network simplification and optimization (47), verification of video streaming 
protocols [35], DNN modification ee robustness evaluation 
verification of recurrent networks |28|, and others. However, to date Marabou 
could not support sign constraints, and thus, could not be used to verify BNNs. 
Below we describe our main contributions to the code base. Our complete code 
is available as an artifact accompanying this paper l, and has also been merged 
into the main Marabou repository B 


Basic Support for Sign Constraints (SignConstraint.cpp). During ex- 
ecution, Marabou maintains a set of piecewise-linear constraints that are part 
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of the query being solved. To support various activation functions, these con- 
straints are represented using classes that inherit from the abstract Piecewise- 
LinearConstraint class. Here, we added a new sub-class, SignConstraint, that in- 
herits from PiecewiseLinearConstraint. The methods of this class check whether 
the piecewise-linear sign constraint is satisfied, and in case it is not — which 
possible changes to the current assignment could fix the violation. This class’ 
methods also extend Marabou’s deduction mechanism for bound tightening. 


Input Interfaces for Sign Constraints (MarabouNetworkTF.py). 
Marabou supports various input interfaces, most notable of which is the Ten- 
sorF low interface, which automatically translates a DNN stored in TensorFlow 
protobuf or savedModel formats into a Marabou query. As part of our exten- 
sions, we enhanced this interface so that it can properly handle BNNs and sign 
constraints. Additionally, users can create queries using Marabou’s native C++ 
interface, by instantiating the SignConstraint class discussed previously. 


Network-Level Reasoner (NetworkLevelReasoner.cpp, Layer.cpp, LP- 
Formulator.cpp). The Network-Level Reasoner (NLR) is the part of Marabou 
that is aware of the topology of the neural network being verified, as opposed to 
just the individual constraints that comprise it. We extended Marabou’s NLR 
to support sign constraints and implement the optimizations discussed in Sec- 
tion |4| Specifically, one extension that we added allows this class to identify 
consecutive weighted sum layers and merge them. Another extension creates a 
linear over-approximation of the network, including the trapezoid-shaped over- 
approximation of each sign constraint. As part of the symbolic bound propaga- 
tion process, the NLR traverses the network, layer by layer, each time computing 
the symbolic bound expressions for each neuron in the current layer. 


Polarity-Based Splitting (DnCManager.cpp). We extended the methods 
of this class, which is part of Marabou’s S&C mechanism, to compute the polarity 
value of each sign constraint (see Definition [1p, based on the current bounds. 


6 Evaluation 


All the benchmarks described in this section are included in our artifact, and 
are publicly available online gl. 


Strictly Binarized Networks. We began by training a strictly binarized net- 
work over the MNIST digit recognition dataset f] This dataset includes 70,000 
images of handwritten digits, each given as a 28 x 28 pixeled image, with nor- 
malized brightness values ranging from 0 to 1. The network that we trained has 
an input layer of size 784, followed by six binary blocks (four blocks of size 50, 


4 http://yann.lecun.com/exdb/mnist/ 
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two blocks of size 10), and a final output layer with 10 neurons. Note that in the 
first block we omitted the sign layer in order to improve the network’s accuracy) 
The model was trained for 300 epochs using the Larg library and the Adam 
optimizer [86], achieving 90% accuracy. 


After training, we used Larq’s ex- . 7 è # 
port mechanism to save the trained i i i j 
network in a TensorFlow format, and s i » ’ 
then used our newly added Marabou in- ` ° i o 


terface to load it. For our verification 

queries, we first chose 500 samples from 

the test set which were classified cor- Figo: Aa adrenal campie ar the 
rectly by the network. Then, we used 

.— MNIST network. 

these samples to formulate adversarial 

robustness queries [B38]: queries that ask Marabou to find a slightly perturbed 
input which is misclassified by the network, i.e. is assigned a different label than 
the original. We formulated 500 queries, constructed from 50 queries for each of 
ten possible perturbation values 6 € {0.1, 0.15, 0.2, 0.3, 0.5, 1,3,5, 10,15} in Lo 
norm, one query per input sample. An UNSAT answer from Marabou indicates 
that no adversarial perturbation exists (for the specified 6), whereas a SAT answer 
includes, as the counterexample, an actual perturbation that leads to misclassifi- 
cation. Such adversarial robustness queries are the most widespread verification 
benchmarks in the literature (e.g., (16][25]33]52)). An example appears in Fig. p] 
the image on the left is the original, correctly classified as 1, and the image on 
the right is the perturbed image discovered by Marabou, misclassified as 3. 

Through our experiments we set out to evaluate our tool’s performance, 
and also measured the contribution of each of the features that we introduced: 
(i) weighted sum (ws) layer elimination; (ii) LP relaxation; (iii) symbolic bound 
tightening (sbt); and (iv) polarity-based splitting. We thus defined five configu- 
rations of the tool: the all category, in which all four features are enabled, and 
four all-X configurations for XE {ws, lp, sbt, polarity}, indicating that feature 
X is turned off and the other features are enabled. All five configurations uti- 
lized Marabou’s parallelization features, except for all-polarity — where instead 
of polarity-based splitting we used Marabou’s default splitting strategy, which 
splits the input domain in half in each step. 

Fig. [10] depicts Marabou’s results using each of the five configurations. Each 
experiment was run on an Intel Xeon E5-2637 v4 CPUs machine, running Ubuntu 
16.04 and using eight cores, with a wall-clock timeout of 5,000 seconds. Most no- 
tably, the results show the usefulness of polarity-based splitting when compared 
to Marabou’s default splitting strategy: whereas the all-polarity configuration 
only solved 218 instances, the all configuration solved 458. It also shows that 
the weighted sum layer elimination feature significantly improves performance, 
from 436 solved instances in all-ws to 458 solved instances in all, and with 
significantly faster solving speed. With the remaining two features, namely LP 


5 This is standard practice; see |https://docs.larq.dev/larg/guides/ 
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relaxations and symbolic bound tightening, the results are less clear: although 
the all-lp and all-sbt configurations both slightly outperform the all configura- 
tion, indicating that these two features slowed down the solver, we observe that 
for many instances they do lead to an improvement; see Fig. Specifically, on 
UNSAT instances, the all configuration was able to solve one more benchmark 
than either all-lp or all-sbt; and it strictly outperformed all-lp on 13% of the 
instances, and all-sbt on 21% of the instances. Gaining better insights into the 
causes for these differences is a work in progress. 


y 
5 

400 
2 
e] 
N 
8 300 
= = all-sbt 
S 
e 52) + all- 
£ 200 i p 
S = al 
2 Æ all-ws 
£ 100 -© all-polarity 
Zz 

0 
(0) 50,000 100,000 


Accumulated Time (s) 


Fig. 10: Running the five configurations of Marabou on the MNIST BNN. 


all v. all-Ip all v. all-sbt 
10000 10000 
1000 1000 
= 2 result 
n n 
£ 100 g 100 oO sat 
Z = O unsat 
oO S 
10 iF 10 
do, ; 
140 14.0 
1 10 100 1000 1000 1 10 100 1000 10000 
all-Ip times (s) all-sbt times (s) 


Fig. 11: Evaluating the LP relaxation and symbolic bound tightening features. 
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XNOR-Net. XNOR-Net [45] is al | a) [a] fel (2 

k I = [e] = [e] a n 

a BNN architecture for image 2] [El J) Jel [El J&) 2] is 
ee al>] 6S) | Je Dis) S|] | | al 2 j> 

recognition networks. XNOR- aj | 8 é al B E S) | 

Nets consist of a series of binary 2 2 al is 


convolution blocks, each contain- 
ing a sign layer, a convolution Fig. 12: The XNOR-Net architecture of our 


layer, and a max-pooling layer network. 
(here, we regard convolution layers as a specific case of weighted sum layers). 
We constructed such a network with two binary convolution blocks: the first 
block has three layers, including a convolution layer with three filters, and the 
second block has four layers, including a convolution layer with two filters. The 
two binary convolution blocks are followed by a batch normalization layer and 
a fully-connected weighted sum layer (10 neurons) for the network’s output, as 
depicted in Fig. Our network was trained on the Fashion-MNIST dataset, 
which includes 70,000 images from ten different clothing categories [55], each 
given as a 28 x 28 pixeled image. The model was trained for 30 epochs, and 
achieved a modest accuracy of 70.97%. 

For our verification queries, we chose 


300 correctly classified samples from l ed x 
the test set, and used them to for- » p ra i 
mulate adversarial robustness queries. i "o “ 
Each query was formulated using ` mo : 


one sample and a perturbation value 
6 € {0.05, 0.1, 0.15, 0.2, 0.25, 0.3} in Lo 
norm. Fig. depicts the adversarial 
image that Marabou produced for one 
of these queries. The image on the left is a correctly classified image of a shirt, 
and the image on the right is the perturbed image, now misclassified as a coat. 

Based on the results from the previous set of experiments, we used Marabou 
with weighted sum layer elimination and polarity-based splitting turned on, but 
with symbolic bound tightening and LP relaxation turned off. Each experiment 
ran on an Intel Xeon E5-2637 v4 machine, using eight cores and a wall-clock 
timeout of 7,200 seconds. The results are depicted in Table[i] The results demon- 
strate that UNSAT queries tended to be solved significantly faster than SAT ones, 
indicating that Marabou’s search procedure for these cases needs further opti- 
mization. Overall, Marabou was able to solve 203 out of 300 queries. To the best 
of our knowledge, this is the first effort to formally verify an XNOR-Net. We 
note that these results demonstrate the usefulness of an SMT-based approach 
for BNN verification, as it allows the verification of DNNs with multiple types 
of activation functions, such as a combination of sign and max-pooling. 


Fig. 13: An original image (left) and its 
perturbed, misclassified image (right). 


7 Related Work 


DNNs have become pervasive in recent years, and the discovery of various faults 
and errors has given rise to multiple approaches for verifying them. These in- 
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Table 1: Marabou’s performance on the XNOR-Net queries. 


SAT UNSAT 

6 # Solved Avg. Time (s) # Solved Avg. Time (s) # Timeouts 
0.05 15 909.13 23 4.96 12 
0.1 15 1,627.67 20 12.15 15 
0.15 9 1,113.33 29 5 12 
0.2 10 1,387.7 24 4.96 16 
0.25 9 1,426 22 4.91 19 
0.3 T 1,550.86 20 26.75 23 
Total 65 1,317.52 138 9.16 97 


clude various SMT-based approaches (e.g., 25} [33} [34][38}), approaches based 
on LP and MILP solvers (e.g., BLAKK), approaches based on symbolic 
interval propagation or abstract interpretation (e.g., (16]50]52)53)), abstraction- 
refinement (e.g., (3}[15)), and many others. Most of these lines of work have 
focused on non-quantized DNNs. Verification of quantized DNNs is PSPACE- 
hard [24], and requires different tools than the ones used for their non-quantized 
counterparts [18]. Our technique extends an existing line of SMT-based verifiers 
to support also the sign activation functions needed for verifying BNNs; and 
these new activations can be combined with various other layers. 

Work to date on the verification of BNNs has relied exclusively on reducing 
the problem to Boolean satisfiability, and has thus been limited to the strictly bi- 
narized case |1129 . Our approach, in contrast, can be applied to binarized 
neural networks that include activation functions beyond the sign function, as 
we have demonstrated by verifying an XNOR-Net. Comparing the performance 
of Marabou and the SAT-based approaches is left for future work. 


8 Conclusion 


BNNs are a promising avenue for leveraging deep learning in devices with limited 
resources. However, it is highly desirable to verify their correctness prior to 
deployment. Here, we propose an SMT-based verification approach that enables 
the verification of BNNs. This approach, which we have implemented as part 
of the Marabou framework pl, seamlessly integrates with the other components 
of the SMT solver in a modular way. Using Marabou, we have verified, for the 
first time, a network that uses both binarized and non-binarized layers. In the 
future, we plan to improve the scalability of our approach, by enhancing it with 
stronger bound deduction capabilities, based on abstract interpretation [16]. 
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Abstract. Modern SAT solvers can emit independently checkable proof 
certificates to validate their results. The state-of-the-art proof system 
that allows for compact proof certificates is propagation redundancy (PR). 
However, the only existing method to validate proofs in this system with 
a formally verified tool requires a transformation to a weaker proof sys- 
tem, which can result in a significant blowup in the size of the proof and 
increased proof validation time. This paper describes the first approach 
to formally verify PR proofs on a succinct representation; we present (i) a 
new Linear PR (LPR) proof format, (ii) a tool to efficiently convert PR 
proofs into LPR format, and (iii) cake_lpr, a verified LPR proof checker 
developed in CakeML. The LPR format is backwards compatible with 
the existing LRAT format, but extends the latter with support for the 
addition of PR clauses. Moreover, cake_lpr is verified using CakeML’s 
binary code extraction toolchain, which yields correctness guarantees for 
its machine code (binary) implementation. This further distinguishes our 
clausal proof checker from existing ones because unverified extraction and 
compilation tools are removed from its trusted computing base. We ex- 
perimentally show that LPR provides efficiency gains over existing proof 
formats and that the strong correctness guarantees are obtained without 
significant sacrifice in the performance of the verified executable. 


Keywords: linear propagation redundancy - binary code extraction 


1 Introduction 


Given a formula of propositional logic, the task of a SAT solver is to decide if 
there exists an assignment that satisfies the formula. Such a satisfying assign- 
ment, if found by a SAT solver, is easily verifiable by independent checkers and 
so one does not need to trust the inner workings of the solver. The situation 
with unsatisfiable formulas, i.e., where no satisfying assignment exists, is not as 
straightforward. Here, SAT solvers must produce an unsatisfiability proof. Ide- 
ally, the proof system (and proof format) for such proofs should be sufficiently 
expressive, allowing SAT solvers to efficiently produce proofs that correspond to 
the SAT solving techniques they use at runtime. At the same time, the resulting 
proofs ought to be efficiently checkable by independent and trustworthy tools. 
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The de facto standard proof system for propositional unsatisfiability proofs is 
known as Resolution Asymmetric Tautology (RAT) [24]. The associated DRAT 
format [36] combines clause addition based on RAT steps and clause deletion. 
Independent checking tools can validate proofs in the DRAT format; they have 
been used to check the results of the SAT competitions since 2014 [36] and 
in industry [15]. Enriching DRAT proofs with hints is the main technique for 
developing efficient verified proof checkers, e.g., existing verified checkers use the 
enriched proof formats LRAT [6] and GRAT [28]. 


A recently proposed proof system, called Propagation Redundancy (PR) [21], 
generalizes RAT. There exist short PR proofs without new variables for many 
problems that are hard for resolution, such as pigeonhole formulas, Tseitin prob- 
lems, and mutilated chessboard problems [19]. Due to the absence of new vari- 
ables it is easier to find PR proofs automatically [20], and it is considered unlikely 
that there exist short RAT proofs for these problems that do not introduce new 
variables nor reuse eliminated variables [21]. Such PR proofs can be checked di- 
rectly [21], or they can first be transformed into DRAT proofs or even Extended 
Resolution proofs by introducing new variables [18,25]. In theory, the blowup is 
small, i.e., polynomial-sized. However, in practice, the transformed proofs can be 
significantly more expensive to validate compared to the original PR proofs [21]. 


A natural question arises: why should proof checkers be trusted to correctly 
check proofs if we do not likewise trust SAT solvers to correctly determine satisfi- 
ability? One answer is that proof checkers are much easier to implement so their 
code can be carefully audited. Another answer is that the algorithms underlying 
proof checkers have been formally verified in a proof assistant [6, 15,28]. How- 
ever, to get executable code for these verified checkers, some additional unverified 
steps are still required. Although unlikely, each of these steps can introduce bugs 
in the resulting executable: (1) the algorithms are extracted by unverified code 
generation tools into source code for a programming language; (2) unverified 
parsing, file I/O, and command-line interface code is added; (3) the combined 
code is then compiled by unverified compilers down to executable machine code. 


The contributions of this paper are: (i) a new Linear PR (henceforth LPR) 
proof format that enriches PR proofs with hints and is backwards compatible 
with the LRAT format; (ii) a tool to efficiently enrich PR proofs with hints; and 
(iii) cake_lpr, an efficient verified LPR proof checker with correctness guaran- 
tees, including for steps (1)—(3) enumerated above. The cake_1pr tool is publicly 
available at https://github.com/tanyongkiam/cake_lpr and it was used to val- 
idate the unsatisfiability proofs in the 2020 SAT Competition because of its 
strong trust story combined with easy compilation and usage. Moreover, the 
stronger proof system could be supported in future competitions. 


Section 3 shows how PR proofs can be enriched to obtain LPR proofs and 
presents the corresponding LPR proof checking algorithm (Contributions i & ii). 
Notably, existing LRAT proof checkers can be extended in a clean and minimal 
way to support LPR proofs. Section 4 explains the implementation of our checker 
in CakeML, as well as the correctness guarantees and high-level verification strat- 
egy behind the proofs (Contribution iii). Section 5 benchmarks our proof format 
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Table 1. A comparison of SAT proof checkers that have been verified in various proof 
assistants [6,15,28]. Green background (cells with +) indicates desirable properties, e.g., 
LPR is based on a stronger proof system than LRAT and GRAT, while red backgrounds 
(cells with x) indicate less desirable properties. Yellow backgrounds (cells with —) are 
also undesirable but to a lesser extent. 


Property | ACL2 checker [15] Coq checker [6] GRATchk [28] cake_lpr 


Proof System 


: — LRAT — LRAT — GRAT 
(Section 3) 
Executable Code _ Directly Unverified Unverified 
(Section 4) Executed Extraction Extraction 


Checking Speed 
l 


and proof checker against existing implementations. A summary comparison of 
the new proof checker against existing verified proof checkers is in Table 1. 


2 Background 


This section provides background on CakeML and its related tools. It also recalls 
the standard problem format and clausal proof systems used by SAT solvers. 


2.1 HOL4 and CakeML 


HOL4 is a proof assistant implementing classical higher-order logic [34]. CakeML 
is a programming language deeply embedded in HOL4, i.e., its abstract syntax 
is represented as a HOL datatype and its semantics is formalized within HOL4. 
Several tools for developing verified CakeML software are used in this work to 
fill the verification gaps in the correspondingly enumerated items in Section 1: 


(1) Two tools are used to produce (or extract) verified CakeML source code: 

— the CakeML proof-producing translator [32] automatically synthesizes 
verified source code from pure algorithmic specifications; 

— the CakeML characteristic formula (CF) framework [14] provides a sep- 
aration logic which can be used to manually verify (more efficient) im- 
perative code for performance-critical parts of the proof checker. 

(2) CakeML provides a foreign function interface (FFI) and a corresponding 
formal FFT model [10]. These are used to verify system call interactions, e.g., 
file I/O and command-line interfaces, under carefully specified assumptions. 

(3) Most importantly, CakeML has a compiler that is verified [35] to preserve 
the semantics of source CakeML programs down to their compiled machine 
code implementations. Hence, all guarantees obtained from the preceding 
steps can be carried down to the level of machine code. 
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The combination of these tools enables binary code extraction [27| where 
verified machine code is extracted directly in HOL4. Several other CakeML-based 
programs have been verified using these tools, including: certificate checkers for 
floating-point error bounds [3] and vote counting [13], and an OpenTheory article 
checker [1]. Œuf provides a similar toolchain in the Coq proof assistant [31]. 


2.2 SAT Problems and Clausal Proofs 


Fix a set of boolean variables £1,..., £n, where the negation of variable x; is 
denoted %;, and the negation of T; is identified with x;. Variables and their 
negations are called literals and are denoted using l. The input for propositional 
SAT solvers is a formula F in conjunctive normal form (CNF) over the set 
of variables 71,...,%,. Here, CNF means that F consists of an outer logical 
conjunction F = iy Ci, where each clause Ci is a disjunction over some of 
the literals C; = l;i V li2,--- V lig. Formulas in CNF can be represented directly 
as sets of clauses and clauses as sets of literals. The empty clause is denoted 
L. An assignment a assigns boolean values to each variable; œ can be partial, 
i.e., it only assigns values to some of the variables. Like formulas and clauses, 
a (partial) assignment can be represented as the set of literals assigned the 
boolean value true by that assignment. The negation of an assignment, denoted 
@, assigns the negation of all literals in a. An assignment a satisfies a clause 
C iff their set intersection is nonempty. Additionally, we define C|a = T if 
a satisfies C; otherwise, C'|q denotes the result of removing from C all the 
literals falsified by a, i.e., Clq = C \ a. For a formula F, we define F|a = 
{Cla | C E F and C|a # T}. Intuitively, F|q contains the remaining clauses 
in formula F after committing to the partial assignment a. 

The task of a SAT solver is to determine whether F is satisfiable, i.e., whether 
there exists a (possibly partial) assignment a such that F |a is empty. Any sat- 
isfying assignment can be used as certificate of satisfiability. Formulas without 
a satisfying assignment are unsatisfiable. Certifying unsatisfiability is more diffi- 
cult and typically uses a clausal proof system [21]. The idea behind these proof 
systems is briefly recalled next, using the key concept of clause redundancy. 


Definition 1. A clause C is redundant with respect to formula F iff FAC and 
F are both satisfiable or both unsatisfiable, i.e., they are satisfiability equivalent. 


A clause C that is redundant for F can be added to F without changing 
its satisfiability. Clausal proof systems work by successively adding redundant 
clauses to F until the empty clause | is added, as illustrated below: 


+ redundant Cı + redundant C2 + redundant C3 
F = FAQ = FAGA =: = FAQACA:::AL 


Satisfiability is preserved along each => step because of redundancy, e.g., 
satisfiability of F implies satisfiability of F A C1. Since the final formula is un- 
satisfiable, the sequence of redundant clause addition steps C1, C2,..., corre- 
sponds to a proof of unsatisfiability for F. Deciding clause redundancy is as hard 
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as solving the SAT problem itself because is always redundant for unsatisfiable 
formulas. The difference between clausal proof systems is how the redundancy 
of a (proposed) redundant clause C is efficiently certified at each proof step. 

Many notions of redundancy are based on unit propagation. A unit clause 
is a clause with only one literal. The result of applying the unit clause rule to 
a formula F is the formula F'|] where (l) is a unit clause in F. The iterated 
application of the unit clause rule to a formula F until no unit clauses are left 
is called unit propagation. If unit propagation on F yields the empty clause L, 
denoted by F'F, L, we say that F implies L by unit propagation. The notion of 
implied by unit propagation is also used for regular clauses as follows: F F, C iff 
FA-C¥, L with =C = Aec(l). Observe that =C can be viewed as a partial 
assignment that assigns the literals l, for l € C, to true. For a formula G, F +, G 
iff F F C for all C € G. The main clausal proof system used in this paper is 
based on propagation redundant clauses, which are defined as follows. 


Definition 2. Let F be a formula, C a nonempty clause, and a the smallest 
assignment that falsifies C. Then, C is propagation redundant (PR) with respect 
to F if there exists an assignment w which satisfies C and such that F |a ki F |w. 


Intuitively, a PR clause C is redundant because any satisfying assignment for 
F that does not already satisfy C can be modified to a satisfying assignment 
for F A C by updating its literals assigned to true according to the (partial) 
witnessing assignment w [21]. Propagation redundancy is efficiently checkable 
in polynomial time using the witnessing assignment and PR generalizes various 
other notions of clause redundancy, including the de facto standard Resolution 
Asymmetric Tautology (RAT) proof system (see [21, Theorem 2]) that is able to 
compactly express all current techniques used in state-of-the-art SAT solvers [24]. 

In practice, clausal proof formats also contain deletion information to speed 
up proof validation. Hence, unsatisfiability proofs for formula F are modeled 
as sequences J;,...,/, of instructions that either add or delete a clause. An 
addition instruction is a triple (a,C,w), where C is a clause and w is a (possibly 
empty) witnessing assignment; a deletion instruction is a pair (d,C) where C is 
a clause. The sequence [,...,Zp gives rise to formulas F),...,F, with Fo = F 
as follows, where F} is the accumulated formula up to the j-th instruction: 


F Fj- U {C} if J; is of the form (a, C, w) 
7 \Fj-1\{0}_ifĀĻ is of the form (d, C) 
A PR proof of unsatisfiability is valid if the last instruction adds the empty 
clause I„ = (a, L, Ø), and, for all addition instructions I; = (a, Cj, wj}, it holds 


that C; is PR with respect to Fj—ı using witness wj. In case an empty witness 
is provided for J;, then Fj—ı Fı C should hold. 


3 Linear Propagation Redundancy 


This section describes a new clausal proof format called LPR (short for Linear 
Propagation Redundancy). The format is designed to allow efficient validation 
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(proof) = {(line)} 

(line) = ((lpr) | (delete)), “\n” 
(lpr) = (id), (clause), l(witness),“0”, (idlist), { (reduced) }, “0” 
(delete) = (id), “d” , (idlist), “0” 
(reduced) = (neg), (idlist) 

(idlist) = {(id)} 

(id) = (pos) 

(lit) = (pos) | (neg) 

(pos) S| OF I a 

(neg) = “—”, (pos) 

(clause) = {(lit)} 

(witness) = {(lit)} 


Fig. 1. The grammar for the LPR format. Additions compared to the LRAT gram- 
mar [6] are highlighted in bold. 


of PR clauses using a (verified) proof checker. We also enhanced the DPR-trim 
tool? to efficiently add hints to PR proofs, thereby turning them into LPR proofs. 
Throughout the section, we emphasize how LPR can be viewed as a clean and 
minimal extension of the existing LRAT proof format, which thereby enables its 
straightforward implementation in existing LRAT tools. 

The most commonly used proof format for SAT solvers is DRAT, which com- 
bines deletion with RAT redundancy [36]. DRAT proofs are easy for SAT solvers 
to emit and top-tier SAT solvers support it, but have some disadvantages for 
verified proof checking. In particular, checking whether a clause is RAT requires a 
significant amount of proof search to find the unit clauses necessary for showing 
the implied-by-unit-propagation property. This complicates verification of the 
proof checking algorithm and slows down the resulting verified proof checkers. 
The idea behind the Linear RAT (LRAT) [6,15] and GRAT [28] formats is to 
include these unit clauses as hints so that verified proof checkers can follow the 
hints directly without the need for proof search. The LPR format lifts this idea 
to allow fast validation of the PR property. 

An assignment w reduces a clause C if C|w C C and C|w 4 T. To check the 
PR property F|a +, F|w, it suffices to check, for each clause C € F reduced 
by w, that F|a ti C|w. Hence, in practice, a smaller w yields a cheaper PR 
check. The LPR format extends the PR format by adding, for each clause that 
is reduced by the witness, a list of all unit clause hints required for showing the 
implied-by-unit-propagation property. Additionally, in order to point to clauses, 
the LPR format includes an index for each clause at the beginning of each line. 
The grammar of the LPR format is shown in Fig. 1. 

Our extension to DPR-trim enriches input PR proofs by finding and adding 
all required unit clause hints. It also shrinks the witness w where possible: every 
literal in w Ma is removed as well as any literal in w that is implied by unit 
propagation from F |a. The shrinking was shown to be correct [21], but has 


3 LPR hint addition is now part of the public GitHub version available at 
https://github.com/marijnheule/dpr-trim using the command-line option -L. 
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DIMACS file LPR proof file 
p cnf 12 22 23 -3 -10 -3 -10 1 12 0 -5 17 -8 20 -19 7 -22 100 
1 2 30 24 -3 -11 -3 -11 2 12 0 -6 18 -9 21 -19 7 -22 10 0 
4 5 60 25 -3 0 23 24 4130 
7 8 90 26 -6 -10 -6 -10 4 12 0 -5 11 -13 7 -14 20 -22 16 0 
10 11 12 0 27 -6 -11 -6 -11 5 12 0 -6 12 -13 7 -15 21 -22 16 0 
-1 -4 0 28 -6 0 26 27 4190 
-2 -5 0 29 -9 -10 -9 -10 7 12 0 -8 11 -13 10 -14 17 -19 16 0 
-3 -6 0 30 -9 -11 -9 -11 8 12 0 -9 12 -13 10 -15 18 -19 16 0 
-1 -7 0 31 -9 0 2930 4220 
-2 -8 0 32 -2 0 6 928 31 2 3140 
-3 -9 0 33 -5 0 6 1525 31 1 3 80 
hiss 34 0 25 28 32 33 1 2 50 


Fig. 2. (Left) The first ten clauses of pigeonhole formula (4 pigeons, 3 holes) in the 
DIMACS format used by SAT solvers. (Right) The LPR refutation consisting of clause- 
witness pairs and unit clause hints. The first bold integer in each line is the clause index 
while other bold integers are the unit clause hints. Dropping the bold integers yields a 
proof in the PR format. Redundant spaces have been added to improve readability. 


not been implemented so far. We observed that the witnesses in the PR proofs 
produced by SaDiCaL [20] can be substantially compressed using this method. 


Fig.2 (left) shows an example formula in the standard DIMACS problem 
format. The DIMACS format includes a header line starting with “p cnf ” fol- 
lowed by the number of variables and the number of clauses. The non-comment 
lines (not starting with “c ”) represent clauses, and they end with “0”. Positive 
integers denote positive literals, while negative integers denote negative literals. 
Fig. 2 (right) shows a corresponding proof in LPR format. Deletion lines in LPR 
are formatted identically to LRAT [6] (not shown here). For clause addition 
lines, the LPR format only differs from LRAT in case the clause to be added 
has PR but not RAT redundancy. A clause addition line in LPR format consists 
of three parts. The first part is the first integer on the line, which denotes the 
index of the new clause. The second part consists of the clause and the witness; 
the first group of literals is the clause. The (potentially empty) witness starts 
from the second occurrence of the first literal of the clause until the first 0 that 
separates the unit clause hints. The second part exactly matches the PR proof 
format [21]. The third part (after the first 0) are the unit clause hints, which 
exactly matches the LRAT format [6]. 


The checking algorithm for LPR, shown in Fig. 3, overlaps significantly with 
that for LRAT (see [6, Algorithm 1]). The only differences are Steps 4 and 
5.1. In Step 4, the witness is used (if present) instead of always using the first 
literal in C;. In Step 5.1, clauses are skipped if they are satisfied by the witness. 
Notice that a clause can only be both reduced and satisfied by a witness if the 
witness consists of at least two literals, while in the LRAT format witnesses 
always consist of exactly one literal. Note also that the algorithm does not check 
whether C;|w = T, which is a requirement for PR. This omission is allowed 
because the first literal in w in the LPR (and PR) format is the same as the first 
literal in C}. 
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Input: CNF F = {C;}iez and line @ an LPR step. 
Output: YES if parsed clause Cj proved PR for F by £, 
NO otherwise. 
1. parse £ as |. Cj, wj, 0, 2, {-i* Ra] 
instantiating variables with (vectors of) positive integers. 
2. set a + AC; 


3. fori c ið 
3.1. set C; + Cila 
3.2. if C} = L, return YES 
3.3. if C; = T or |C;| > 2, return NO 


3.4. set œa + a U C} 
4. if wj Æ Ø then set w + wj else set w 4+ (Cj)1 
(if C; = L, return NO) 
5. fric T 
5.1. if C; is satisfied by w or is not reduced by w, 
skip to next iteration of Step 5. 
5.2. find k such that i" = i (from 4) 
(return NO if no such k exists) 
5.3. if Ci|(a\@) = T, skip 
5.4. set a’ + a U (AC; \ w) 
5.5. for m € i* 
5.5.1. set Ch — Cm|a’ 
5.5.2. if Ch = L, skip to next iteration of Step 5. 
5.5.3. if Ch = T or |Ch] > 2, return NO 
5.5.4. set a’ - a’ UCI, 
5.6. return NO 
6. return YES 


Fig. 3. Algorithm to check a single clause addition step in the LPR format. The bold 
parts show the additions compared to LRAT proof checking [6]. 


4 CakeML Proof Checking 


This section explains the implementation and verification of cake_lpr, our veri- 
fied CakeML LPR proof checker. Section 4.1 focuses on the high-level verification 
strategy which we used to reduce the verification task to mostly routine low-level 
proofs (the latter details are omitted). Section 4.2 highlights important verified 
performance optimizations used in the proof checker. 


4.1 Verification Strategy 


The development of cake_lpr proceeds in three refinement steps, where each 
step progressively produces a more concrete and performant implementation of 
the proof checker. These refinements are visualized in the three columns of Fig. 4. 

Step 1 formalizes the definition of CNF formulas and their unsatisfiability, as 
well as the PR proof system described in Section 2.2. The inputs and outputs to 
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Step 1 Step 2 Step 3 


Abstract CNF a Concrete CNF a DIMACS 
Formula (lift repr.) Formula (parse) Input File 
Input 
PR Proof = LPR Checker = EA 
System (pure impl.) (Fig. 3) (imp. impl.) SE 
Output 
Valid Proof a YES or NO = VERIFIED UNSAT 
(unsat.) ? (verified) (Fig. 3) (verified) or ERROR 


Fig. 4. The three step refinement used in the development of cake_lpr. 


the proof system are abstract and not tied to any concrete representation at this 
step. For example, input variables are drawn from an arbitrary type a, clauses 
and CNFs are represented using sets. The correctness of the PR proof system is 
proved in this step, i.e., we show that a valid PR proof implies unsatisfiability of 
the input CNF. The proof essentially follows [21, Theorem 1]. 

Step 2 implements a purely functional version of the LPR proof checking al- 
gorithm from Fig. 3. Here, the inputs and outputs are given concrete representa- 
tions with computable datatypes, e.g., literals are integers (similar to DIMACS), 
clauses are lists of integers, and CNFs are lists of clauses. These concrete rep- 
resentations lift naturally to the abstract, set-based representation from Step 1. 
The output is a YES or NO answer according to the algorithm from Fig.3. The 
correctness theorem for Step 2 shows that LPR proof checking correctly refines 
the PR proof system, i.e., if it outputs YES, then there exists a valid PR proof for 
the input (lifted) CNF; by Step 1, this implies that the CNF is unsatisfiable.* 

Step 3 uses imperative features available in the CakeML source language, e.g., 
(byte) arrays and exceptions, to improve code performance; these optimizations 
are detailed further in Section 4.2. This step also adds user interface features like 
parsing and file I/O so that the input CNF formula is read (and parsed) from 
a file, and the results are printed on the standard output and error streams. 
The verification of this step uses CakeML’s proof-producing translator [32] and 
characteristic formula framework [14] to prove the correctness of the source code 
implementation of cake_lpr; this code is subsequently compiled with the veri- 
fied CakeML compiler. Composing the correctness theorem for source cake_lpr 
with CakeML’s compiler correctness theorem yields the corresponding correct- 
ness theorem for the cake_lpr binary. The final correctness theorem is given 
in Appendix A. Briefly, it shows that if the cake_lpr executable prints the 
string “s VERIFIED UNSAT\n” to the standard output stream (in CakeML’s FFI 
model [10]), then the input (parsed) DIMACS file is an unsatisfiable CNF. 


4 If the output is NO, the input CNF could still be unsatisfiable, but the input LPR 
proof is not valid according to the algorithm in Fig. 3. 
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4.2 Verified Optimizations 


To minimize verification effort, CakeML’s imperative features are only used for 
the most performance-critical steps of cake_lpr. Our design decisions are based 
on empirical observations about the LPR proof checking algorithm. These are 
explained below with reference to specific steps in the algorithm from Fig. 3. 


Array-based representations. In practice, many LPR proof steps do not re- 
quire the full strength of a PR (or RAT) clause. Hence, a large part of proof 
checking time is spent in the Step 3 loop of the algorithm and it is important to 
compute the main loop bottleneck, C;|q in Step 3.1, as efficiently as possible. 
CakeML’s native byte arrays are used to maintain a compact bitset-like repre- 
sentation of the assignment a, so that C;|q can be computed in one pass over 
Ci with constant time bitset lookup for each literal in C;. 

For proof steps requiring the full strength of PR clauses, Step 5 loops over 
all undeleted clauses in the formula. Formulas are represented as an array of 
clauses° together with a lazily updated list that tracks all indices of the array 
containing undeleted clauses. This enables both constant-time lookup of clauses 
throughout the algorithm and fast iteration over the undeleted clauses for Step 5. 
Deletion in the index list is done in (amortized) constant time by removing a 
deleted index only when the index is looked up in Step 5.1. Additionally, for 
each literal, the smallest clause index where that literal occurs (if any) is lazily 
tracked in a lookup array; for a given witness w, all clauses occurring at indices 
below the index of any literal in W can be skipped in Step 5.1. 


Proof checking exceptions. There are several steps in the proof checking 
algorithm that can fail (report NO) if the input proof is invalid, e.g., in Step 3.3. 
In a purely functional implementation, results are represented with an option: 
None indicating a failure and Some res indicating success with result res. While 
conceptually simple, this means that common case (successful) intermediate re- 
sults are always boxed within an option and then immediately unboxed with 
pattern matching to be used again. In cake_lpr, failures instead raise excep- 
tions which are directly handled at the top level. Thus, successful results can be 
passed directly, i.e., as res, without any boxing. Support for verifying the use of 
exceptions is a unique feature of CakeML’s CF framework [14]. 


Buffered I/O streams. Proof files generated by SAT solvers can be large, e.g., 
ranging from 300 MB to 4 GB for the second benchmark suite in Section 5. These 
files are streamed into memory line by line because each proof step depends only 
on information contained in its corresponding line in the file. This streaming 
interaction is optimized using CakeML’s verified buffered I/O library [29] which 
maintains an internal buffer of yet-to-be-read bytes from the read-only proof file 
to batch and minimize the number of expensive filesystem I/O calls. 


5 Deleted clauses are no longer referenced by the array and are automatically freed by 
CakeML’s garbage collector. 
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5 Benchmarks 


This section compares the verified CakeML LPR proof checker against other 
verified checkers on two benchmark suites and a RAT microbenchmark. The first 
suite is a collection of problems with PR proofs generated by the satisfaction- 
driven clause learning (SDCL) solver SaDiCaL [20], while the second suite con- 
sists of unsatisfiable problems from the SAT Race 2019 competition. The RAT 
microbenchmark consists of proofs for large mutilated chessboards generated by 
a BDD-based SAT solver [5]. The CakeML checker is labeled cake_1pr (default 
4GB heap and stack space), while other checkers used are labeled acl2-lrat 
(verified in ACL2 [15]), coq-lrat (verified in Coq [6]), and GRATchk (verified 
in Isabelle/HOL [28]) respectively. All experiments were run on identical nodes 
with Intel Xeon E5-2695 v3 CPUs (35M cache, 2.30GHz) and 128GB RAM. 
Configuration options specific to each benchmark suite are reported below. 


5.1 SaDiCaL PR Benchmarks 


The SaDiCaL solver produces PR proofs for hard SAT problems in its benchmark 
suite [20] and it is experimentally much faster than a plain DRAT-based CDCL 
solver on those problems [20, Section 7]. The PR proofs are directly checked 
by cake_lpr after conversion into LPR format with DPR-trim. For all other 
checkers, the PR proofs were first converted to DRAT format using pr2drat (as 
in the earlier approach [20]), and then into LRAT and GRAT formats using the 
DRAT-trim and GRATgen’ tools respectively. All tools were ran with a timeout 
of 10000 seconds and all timings are reported in seconds (to one d.p.). Results 
are summarized in Tables 2 and 3. 

All benchmarks were successfully solved by SaDiCaL except mchess19 which 
exceeded the time limit. For the remaining benchmarks, generating and check- 
ing LPR proofs required a comparable (1—2.5x) amount of time to solving the 
problems, except mchess, for which LPR generation and checking is much faster 
than solving (Table 2). Unsurprisingly, direct checking of LPR proofs is much 
faster than the circuitous route of converting into DRAT and then into either 
LRAT or GRAT (Table3). Unlike LPR, checking PR proofs via the LRAT route 
is 5-60x slower than solving those problems; this is a significant drawback to 
using the route in practice for certifying solver results. 

The backwards compatibility of cake_lpr is also shown in Table 3, where 
it is used to check the generated LRAT proofs. Among the LRAT checkers, 
acl2-lrat is fastest, followed by cake_lpr (LRAT checking), and coq-lrat. Al- 
though cake_lpr (LRAT checking) is on average 1.3x slower than acl2-lrat, it 
scales better on the mchess problems and is actually much faster than acl2-lrat 
on mchess18. We also observed that the GRAT toolchain (summing SaDiCaL, 
pr2drat, GRATgen and GRATchk times) is much slower than the LRAT toolchains 


ê The suites are available at http://fmv.jku.at /sadical/ and http://sat-race-2019.ciirc. 
cvut.cz/ respectively. 
 GRATgen, the only tool that supports parallelism, was ran with 8 threads. 
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Table 2. Timings for PR benchmarks with conversion into LPR format. The “Total 
(LPR)” column sums the generation and checking times. The timing for mchess19 is 
omitted because SaDiCaL timed out; timings for the Urquhart U.-s3-* benchmarks are 
omitted because they took a negligible amount of time (< 1.0s total). 


Problem SaDiCaL DPR-trim cake_lpr Total Problem SaDiCaL DPR-trim cake_lpr Total 


(LPR) (LPR) (LPR) (LPR) 
hole20 1.0 0.5 0.7 22 U.-s4-b1 0.7 0.6 0.3 16 
hole30 6.9 2.4 6.1 15.4 U.-s4-b2 0.3 0.4 0.2 0.8 
hole40 31.3 10.0 25.1 66.3 U.-s4-b3 0.4 0.4 0.2 1.0 
hole50 101.7 35.5 87.9 225.1 U.-s4-b4 0.3 0.5 03 11 
mchess15 18.5 1d 2.1 21.7 U.-s5-b1 2.5 0.9 13 47 
mchess16 21.7 1.2 2.1 25.0 U.-s5-b2 1:2 0.6 0.7 24 
mchess17 34.8 1.6 3.4 39.8 U.-s5-b3 3.2 1.5 2.0 6.8 

U 


mchess18 59.8 2.3 5.2 67.2 .-s5-b4 5.5 1.5 3.2 10.1 
Table 3. Timings for PR benchmarks, first converted to DRAT and subsequently 
converted into LRAT and GRAT formats. The “Total (LRAT)” and “Total (GRAT)” 
columns sum the fastest generation and checking times for the LRAT and GRAT 
formats respectively. The “Total (LPR)” column (in bold, fastest total time) is repro- 
duced from Table 2 for ease of comparison. Fail(T) indicates a timeout. Timings for 
the mchess19 and U.-s3-* benchmarks are omitted as in Table 2. 


Prob. pr2drat DRAT-trim cake_lpr acl2-lrat coq-lrat GRATgen GRATchk Total ‘Total Total 


(LRAT) (LPR) (LRAT) (GRAT) 
hole20 0.8 4.4 18.5 7.9 966.7 46 182 22 142 24.6 
hole30 6.8 61.4 180.4 105.9 Fail(T) 24.5 647.9 15.4 181.0 686.1 
hole40 32.4 460.0 1039.5 711.8 Fail(T) 101.3 Fail(T) 66.3 1235.5 - 
hole50 108.6 2663.0 4697.4 3292.2 Fail(T) 337.2 Fail(T) 225.1 6165.5 = 
mchess15 7.7 48.2 49.3 36.2 Fail(T) 48.4 2023.1 21.7 110.6 2097.7 
mchess16 9.0 62.0 59.8 53.2 Fail(T) 55.2 2903.8 25.0 145.9 2989.6 
mchess17 14.5 105 97.3 88.5 Fail(T) 86.1 7050.9 39.8 242.7 7186.3 
mchessi8 25.1 195.0 152.7 296.8 Fail(T) 135.9 Fail(T) 67.2 432.5 = 
U.-s4-b1 0.5 2.5 3.6 3.3 135.7 3.6 448 1.6 7.0 49.7 
U.-s4-b2 0.2 0.8 1.4 1.0 23.2 1.7 8.2 0.8 2.3 10.4 
U.-s4-b3 0.3 1.3 2.0 1.5 49.2 24 162 1.0 3.5 19.3 
U.-s4-b4 0.3 1.1 1.8 1.4 38.3 20 103 L1 3.1 12.9 
U.-s5-b1 4.2 13.6 16.7 12.5 3048.7 17.4 933.2 4.7 32.8 957.3 
U.-s5-b2 iT 5.6 7.3 5.5 614.7 7.7 189.6 24 13.9 200.2 
U.-s5-b3 5.0 18.4 26.3 22.2 8750.5 21.1 2316.3 6.8 48.8 2345.6 
U.-s5-b4 11.3 34.2 36.9 30.1 Fail(T) 40.6 Fail(T) 10.1 81.0 = 


(summing SaDiCaL, pr2drat, DRAT-trim and fastest LRAT checking times). 
This is in contrast to the SAT Race 2019 benchmarks below (Fig. 5), where we 
observed the opposite relationship. We believe that the difference in checking 
speed is due to the various checkers having different optimizations for checking 
the expensive RAT proof steps produced by conversion from PR proofs. 


5.2 SAT Race 2019 Benchmarks 


We further benchmarked the verified checkers on a suite of 117 unsatisfiable 
problems from the SAT Race 2019 competition. For all problems, DRAT proofs 
were generated using the state-of-the-art SAT solver CaDiCaL before conversion 
into the LRAT or GRAT formats. Notably, proofs generated by CaDiCaL on this 
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Table 4. A summary of the SAT Race 2019 benchmark results. The N/A row counts 
problems that timed out or failed in an earlier step of the respective toolchains. 


Status CaDiCaL DRAT-trim acl2-lrat cake_lpr coq-lrat GRATgen GRATchk 


Success 102 97 96 97 36 100 100 
Timeout 15 5 0 0 61 0 0 
Failure 0 0 1 0 0 2 0 
N/A 0 15 20 20 20 15 17 
100 F , i ai i : — H 
mo —»— acl2-lrat i 
gL 80F = 
g —+— cake_lpr 
5 —+— coq-lrat 
>  60F 3 4 
4 
© 
2 
a 40} 
T 
© 
= 20} 
0 oy e = L L 
10° 10} 10° 103 
Time limit (seconds, logarithmic scale) 
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3 —o— GRATgen + GRATchk of 
F 80 | DRAT-trim + acl2-1rat 
$ —«— DRAT-trim + cake_lpr 
a 60F | 
“ 
O 
2 
a 40} - 
8 
E 20}- 4 
0 oe iP 
10° 10+ 10? 103 104 


Time limit (seconds, logarithmic scale) 


Fig. 5. (Top) Total SAT Race 2019 proofs checked within a given (per instance) time 
limit for the LRAT proof checkers. (Bottom) Total SAT Race 2019 proofs generated and 
checked within a given (per instance) time limit for the LRAT and GRAT toolchains. 


suite rarely require RAT (or PR) steps, so the checkers are stress-tested on their 
implementation of file I/O, parsing, and Step 3.1 from Fig.3; cake_lpr is the 
only tool with a formally verified implementation of the former two steps. All 
tools were ran with the SAT competition standard timeout of 5000 seconds. 

A summary of the results is given in Table 4. All proofs generated by CaDiCaL 
were checked by at least one checker. The acl2-lrat checker fails with a parse 
error on one problem even though none of the other checkers reported such an 
error; GRATgen aborted on two problems for an unknown reason. Plots com- 
paring LRAT proof checking time and overall proof generation and checking 
time (LRAT and GRAT) are shown in Fig. 5. From Fig.5 (top), the relative 
order of LRAT checking speeds remains the same, where cake_lpr is on av- 
erage 1.2x slower than acl2-lrat, although cake_lpr is faster on 28 bench- 
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Table 5. Timings for the RAT microbenchmark. The number of proof steps and file size 
of the proofs (in MB) are shown in the last two columns. Fail(T) indicates a timeout. 


Problem pgbdd lrat-check cake_lpr acl2-lrat coq-lrat LRAT Steps File Size 


mchess20 3.9 0.5 0.5 19.6 3405.2 125752 5.1 
mchess40 47.5 1.0 3.5 453.4 Fail(T) 769287 36 
mchess60 311.7 2.7 10.6 4885.2 Fail(T) 2300522 114 
mchess80 1164.1 4.8 22.6 Fail(T) Fail(T) 5089457 259 
mchess100 3599.0 9.3 44.2 Fail(T) Fail(T) 9506092 499 


marks. From Fig. 5 (bottom), both LRAT toolchains are slower than the GRAT 
toolchain (average 3.5 times slower for cake_lpr and 3.4 times for acl2-lrat). 
Part of the speedup for GRAT comes from GRATgen, which is the only tool that 
can be ran in parallel (with 8 threads). This suggests that adding native support 
for GRAT-based input to cake_lpr could be a worthwhile future extension. 


5.3 Mutilated Chessboard RAT Microbenchmarks 


The final microbenchmark suite tests the LRAT checkers on large mutilated 
chessboard problem instances (up to 100 by 100) solved by pgbdd, a BDD-based 
SAT solver [5]. Unlike the previous two suites, LRAT proofs are emitted directly 
by the solver so additional DRAT-trim conversion is not needed. All tools were ran 
with a timeout of 10000 seconds and all timings are reported in seconds (to one 
d.p.). For additional scaling comparison, we also report results for Lrat-check, 
an unverified LRAT proof checker implemented in C. 

The results in Table 5 show the impact of cake_lpr’s RAT optimizations 
(Section 4.2). Notably, cake_lpr scales essentially linearly in the size of the 
proofs (up to ~ 10 million proof steps). As a result, cake_lpr is significantly 
faster than acl2-lrat and coq-lrat on these RAT-heavy proofs and it comes 
within a 5x factor of the unverified lrat-check tool. 


6 Related Work 


Verified Proof Checking. There are several RAT-based verified proof checkers, 
in ACL2 [15], Coq [6], and Isabelle/HOL [28]. All three checkers are based on 
extensions of DRAT, which is itself an extension of the DRUP format [16]; the 
Coq checker is based on a predecessor for the GRIT [7] format. The ACL2 checker 
can be efficiently and directly executed (without extraction) using imperative 
primitives native to the ACL2 kernel [15]. However, the implementation of these 
features in ACL2 itself must be trusted to trust the proof checking results, hence 
the yellow background in Table1. SMTCoq [2,9] is another certificate-based 
checker for SAT and SMT problems in Coq. Its resolution-based proof certificates 
can be checked natively using native computation extensions of the Coq kernel. 


Applications. SAT solving is a key technology underlying many software and 
hardware verification domains [4, 23]. Certifying SAT results adds a layer of 
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trust and is clearly a worthwhile endeavor. Solver-aided mathematical results [17, 
22, 26] are particularly interesting and challenging to certify because these of- 
ten feature complicated SAT encodings, custom (hand-crafted) proof steps, and 
enormous resulting proofs [22]. Our cake_lpr checker can handle the latter two 
challenges effectively. For the first challenge, the SAT encoding of mathematical 
problems can also be verified within proof assistants. This was demonstrated for 
the Boolean Pythagorean Triples problem building on the Coq proof checker [8]. 


Verified SAT Solving. An alternative to proof checking is to verify the SAT 
solvers [11, 12,30, 33]. This is a significant undertaking but it would allow the 
pipeline of generating and checking proofs to be entirely bypassed. Furthermore, 
such verification efforts can yield new insights about key invariants underlying 
SAT solving techniques compared to prior pen-and-paper presentations, e.g., the 
2WL invariant [12]. However, the performance of verified SAT solvers are not 
yet competitive with modern (unverified) SAT solving technology [11, 12]. 


7 Conclusion 


This work presents the new LPR proof format for verified checking of PR proofs. 
It demonstrates the feasibility of using binary code extraction to verify a perfor- 
mant LPR proof checker, cake_lpr, down to its machine code implementation. 

Given the strength of the PR proof system, there is ongoing research into the 
design of satisfaction-driven clause learning techniques [20,21] for SAT solvers 
based on PR clauses. Our proof checker opens up the possibility of using a verified 
checker to help check and debug the implementation of these new techniques. 
It also gives future SAT competitions the option of providing PR as the default 
(verified) proof system for participating solvers. 
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A Correctness Theorem for cake_lpr 


The correctness theorem for cake_lpr verified in HOL4 is shown in Fig. 6. The 
assumptions (1) (in red) are routine for compiled CakeML programs that use 
its basis library. The first line assumes that the command-line cl and file system 
fs models are well-formed. The second line assumes that the compiled code is 
correctly placed into (code) memory according to CakeML’s x64 machine model. 
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H wfcl cl A wfFS fs A std_ streams fs ^ hasFreeFD fs > 


installed x64 cake_Ipr_code (basis_ ffi cl fs) mc ms > \ (1) 
machine sem mc (basis_ ffi cl fs) ms C 
extend _with_resource_ limit (2) 
{ Terminate Success (cake _Ipr_io events cl fs) } A 
J out err. 
extract_ fs fs (cake _Ipr_io events cl fs) = 
Some (add stdout (add_ stderr fs err) out) A 
if out = «s VERIFIED UNSAT\n» then (3) 
(length cl = 3 V length cl = 4) A inFS_ fname fs (el 1 cl) A 
dmv fml. 
parse_dimacs (all_lines fs (el 1 cl)) = Some (mv,fml) A 
unsatisfiable (interp fml) 
else if length cl = 2 A inFS_ fname fs (el 1 cl) then 
case parse_dimacs (all_ lines fs (el 1 cl)) of (4) 
None => out = «» 
| Some (mv,fml) = out = concat (print_dimacs fml) 


else out = «» 


Fig. 6. The end-to-end correctness theorem for the CakeML LPR proof checker. 


The first guarantee (2) (in blue) is that the machine code implementation 
always terminates normally according to CakeML’s x64 machine code semantics. 
In particular, the code never crashes and may emit some I/O events when run; 
however, it possibly terminates with an out-ofmemory error (extend with _re- 
source limit) when CakeML runs out of stack or heap space. 

The main correctness guarantee for cake_lpr is (3) (in green) and (4) (in 
black). Briefly, (3) says that the only observable change to the filesystem after 
executing cake_lpr are strings printed on standard output out and standard 
error err. According to (3), if the string “s VERIFIED UNSAT\n” is printed onto 
standard output, then cake_lpr was provided with a file (in its first command- 
line argument), and the file parses in DIMACS format to a formula fml which is 
unsatisfiable. The remaining else case (4), says that the only other possibilities 
for standard output are either (i) a printed version of the parsed DIMACS file (if 
no LPR proof file is provided), or (ii) the empty string. All other error messages 
are printed onto standard error. 

In addition, the DIMACS parser (parse_dimacs) is proved to be left inverse 
to the DIMACS printer (print_dimacs) in the following sense: 


Hwf_fml fml > 
Amv fml'. 
parse_dimacs (print_dimacs fml) = Some (mv,fml’) A interp fml = interp fm’ 


Briefly, this says that for any well-formed formula fml, printing that for- 
mula into DIMACS format then parsing it yields another formula fml’ which is 
guaranteed to have the same interpretation according to the semantics of CNFs 
formalized in HOL4. All parsed formulas are well-formed (not shown here). 
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Abstract. Deductive verification has been successful in verifying inter- 
esting properties of real-world programs. One notable gap is the limited 
support for floating-point reasoning. This is unfortunate, as floating-point 
arithmetic is particularly unintuitive to reason about due to rounding 
as well as the presence of the special values infinity and ‘Not a Num- 
ber’ (NaN). In this paper, we present the first floating-point support in 
a deductive verification tool for the Java programming language. Our 
support in the KeY verifier handles arithmetic via floating-point decision 
procedures inside SMT solvers and transcendental functions via axioma- 
tization. We evaluate this integration on new benchmarks, and show that 
this approach is powerful enough to prove the absence of floating-point 
special values—often a prerequisite for further reasoning about numeri- 
cal computations—as well as certain functional properties for realistic 
benchmarks. 


Keywords: Deductive Verification - Floating-point Arithmetic - Tran- 
scendental Functions. 


1 Introduction 


Deductive verification has been successful in providing functional verification for 
programs written in popular programming languages such as Java [4, 23, 41,49], 
Python [29], Rust [6], C [25,54], and Ada [19,50]. Deductive verifiers allow a 
user to annotate methods in a program with pre- and postconditions, from which 
they automatically generate verification conditions (VCs). These are then either 
proven directly by the verifier itself, or discharged with external tools such as 
automated (SMT) solvers or interactive proof assistants. 

While deductive verifiers fully implement many sophisticated data represen- 
tations (including heap data structures, objects, and ownership), support for 
floating-point numbers remains rather limited — solely Frama-C and SPARK offer 
automated support for floating-point arithmetic in C and Ada [32]. This state 
of affairs is at least partially a result of previous limitations in floating-point 
support in SMT solvers. Consequently, deductive verification has been used for 
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floating-point programs only by experts with considerable manual effort [15,32]. 
This is unfortunate as it makes deductive verification unavailable for a large 
number of programs across many domains including embedded systems, machine 
learning, and scientific computing. With the increasing need for parallelization 
in code, scientific computing specifically has recently experienced algorithmic 
challenges for which formal methods may contribute to a solution [10,56]. 

One of the main challenges of floating-point arithmetic is its unintuitive 
behavior and the special values that the IEEE 754 standard [39] introduces. 
For instance, an overflow or a division by zero results in the special value 
(positive or negative) infinity, and not a runtime exception. Similarly, invalid 
operations like sqrt(-1.0) result in a Not a Number (NaN) value. These special 
values are problematic as seemingly straight-forward identities do not hold (x 
== x or x * 0.0 == 0.0). In addition, every operation on floating-point numbers 
potentially involves rounding, which compromises familiar rules like associativity 
and distributivity. Hence, reasoning support for writing correct floating-point 
programs is indispensable. 


Abstract interpretation-based tools can prove the absence of runtime errors 
and special values [20, 43], and bound roundoff errors due to floating-point’s 
finite precision [11, 21, 26,36,57]. SMT decision procedures [18] or SAT-based 
model-checking [24,56], on the other hand, can prove intricate properties requiring 
bit-precise reasoning. However, these techniques and tools largely support only 
purely floating-point programs or program snippets, or analyze programs only 
up to a predefined depth of the call stack. General reasoning about real-world 
object-oriented programs, however, also requires support for features such as the 
(unbounded) heap, necessitating different analyses which need to be combined 
with floating-point reasoning. 


Handling floating-points in a deductive verifier has unique advantages. First, 
the deductive verification approach already comes with the infrastructure for 
reasoning about complex control and data structures (like exception handling and 
heap). Second, it allows one to flexibly combine the verifier’s symbolic execution 
reasoning with external decision procedures. Third, depending on the theory 
support, the verifier or external solver may also generate counterexamples of a 
property and thus help program debugging — something an abstract interpretation- 
based approach fundamentally cannot provide. 


We report on adding floating-point support to the KeY deductive verifier, 
providing the first automated deductive floating-point support for the Java 
programming language. We focus mainly on proving the absence of the special 
values infinity and NaN. While these are helpful in certain circumstances, for most 
applications they signal an error. Hence, showing their absence is a prerequisite 
for further (functional) reasoning. That said, our extension also allows one to 
express and discharge arbitrary functional properties expressible in floating-point 
arithmetic, including bounds on roundoff errors for certain programs, and bounds 
on differences between two similar floating-point programs 

We exploit both KeY’s symbolic execution and external SMT support. On 
the one hand, we handle arithmetic operations by relying on a combination of 
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KeY’s symbolic execution to handle the heap and SMT based decision procedures 
to handle the floating-point part of the VCs. On the other hand, we support 
transcendental functions via axiomatization in the KeY prover itself. 

Transcendental functions such as sine are a common feature in numerical 
programs, but are not supported by floating-point decision procedures. We explore 
two ways of supporting them soundly but approximately, by encoding them as 
axiomatized uninterpreted function symbols once directly in the SMT queries, 
and once in additional calculus rules in KeY. Our evaluation shows that even 
though such reasoning is approximate, it is nonetheless sufficient to prove the 
absence of special values in many interesting programs. 

We evaluate KeY’s floating-point support on a number of real-world floating- 
point Java programs. Our benchmark set allows us to evaluate recent progress in 
SMT floating-point support in Z3 [28], CVC4 [8] and MathSAT [22] on yet unseen 
benchmarks. For instance, we observe that quantifiers are challenging even if they 
do not affect satisfiability of SMT queries. Our benchmarks are openly available, 
and we expect our insights to be useful for further solver development. 


Contributions In summary, we make the following contributions: 


— we implement and evaluate the first automated deductive verification of 
floating-point Java programs by combining the strength of rule based and 
SMT based deduction; 

— we collect a new set of challenging real-world floating-point benchmarks in 
Java (available at https://gitlab.mpi-sws.org/AVA/key- float -benchmarks/); 

— we compare different SMT solvers for discharging floating-point VCs on this 
new set of benchmarks; 

— and we develop novel automated support for reasoning about transcendental 
functions in a deductive verifier. 


2 Background 


2.1 Introduction to KeY 


KeY [4] is a platform for deductive verification of Java programs, working at a 
source code level. The input is a Java program annotated in the Java Modeling 
Language (JML) [45], encouraging a Design by Contract ([46,51]) approach to 
software development. The user specifies the expected behavior of Java classes 
with class invariants that the program has to maintain at critical points. Methods 
are specified with method contracts, consisting mainly of pre- and postconditions, 
with the understanding that if the precondition holds when the method is called, 
the postcondition has to hold after the method returns. 

After loading an annotated program, KeY translates it to a formula in 
Java Dynamic Logic [4] (JavaDL), an instance of Dynamic Logic [37] which 
enables logical reasoning about Java programs. Logical rules are provided for 
the translation of programs into first-order logic, and for closing the resulting 
goals, or proof obligations. KeY is semi-interactive in that it allows manual rule 
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application, while also offering powerful built-in automation and macros. In 
addition, it is also possible to translate an open goal into SMT-LIB format [9] 
and call an external SMT solver. For specific theories, SMT solvers can be much 
more efficient than KeY’s own automation. This makes it possible to prove some 
goals, which depend on SMT supported theories, by using an SMT solver, while 
others are proved internally, using KeY’s own automation. 


2.2 Floating-Point Arithmetic in Java 


In the following, we summarize some central characteristics of Java floating-point 
numbers, loosely following [53]. Each normal floating-point number x can be 
represented as a triplet (s, m,e), such that x = (—1)* x mx 2°, where s € {0,1} 
is the sign, m (called significand) is a binary fixed-point number with one digit 
before the radix point and p—1 digits after the radix point (note that 0 < m < 2), 
and e (exponent) is an integer such that emin < €e < emar. Java supports two 
floating-point formats (both in base 2): float (‘single’) precision with p = 24, and 
minimal and maximal exponent emin = —126, ema = 127 and double precision 
with p = 53, emin = —1022, emar = 1023. 

Whenever the result of a computation cannot be exactly represented with 
the given precision, it is rounded. IEEE 754 defines various rounding modes, of 
which Java only supports round to nearest, ties to even. Rounding is exact, as if 
one would first compute the ideal real number, and round afterwards. 

The triple representation gives us two zeros, +0 and —0, represented by 
(0,0,0) and (1,0,0), respectively. If the absolute value of the ideal result of a 
computation is too small to be representable as a floating-point number of the 
given format, the resulting floating point number is +0 or —0. In addition, there 
are three special values, +00, —oo, and NaN (Not a Number). If the absolute 
value of the ideal result of a computation is too big to be representable as a 
floating-point number of the given format, the result is +00 or —oo. Also, division 
by zero will give an infinite result (e.g., 7.13/+0 = +00). Computing further with 
infinity may give an infinite result (e.g., +00 + +00 = +00), but may also result 
in the additional ‘error value’ NaN (e.g., +00 — +00 = NaN). Due to the presence 
of infinities and NaN, floating-point operations do not throw Java exceptions. 

By default, the Java virtual machine is allowed to make use of higher-precision 
formats provided by the hardware. This can make computation more accurate, 
but it also leads to platform dependent behaviour. This can be avoided by using 
the strictfp modifier, ensuring that only the single and double precision types 
are used. This modifier ensures portability. 


3 Floating-Point Support in KeY 


3.1 Arithmetics 


In order to be able to specify and verify programs containing floating-point 
numbers, we made several extensions to the KeY tool. First, we added the float 
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Listing 1.1: The Rectangle.scale benchmark 


/*@ public normal_behavior 
@ requires \fp_nice(arg@.x) && \fp_nice(arg0.y) 
@ && \fp_nice(argl1) && \fp_nice(arg2); 
@ ensures !\fp_nan(\result.x) && !\fp_nan(\result.y) && 
@ !\fp_nan(\result.width) && !\fp_nan(\result.height); 
@ also 
@ public normal_behavior 
@ requires -5.53 <= arg0.x & arg0.x <= -3.38 & 
@ -5.53 <= arg0.y && arg0.y <= -3.38 && 
@ 3.1 < argO.width && argQ.width <= 3.7332 & 
@ 3.0000001 < arg0.height && argQ.height <=4.0004 && 
@ 3.0003001 < arg1 && argl <= 4.0024 && 
@ -6.4000003 < arg2 && arg2 <= 3.0001; 
@ ensures !\fp_nan(\result.x) && !\fp_nan(\result.y)& 
@ !\fp_nan(\result.width) &&!\fp_nan(\result.height); 
@*/ 

public Rectangle scale(Rectangle arg0, double argl, double arg2){ 
Area v1 = new Area(arg0); 
AffineTransform v2 = AffineTransform.getScaleInstance(argl, arg2); 
Area v3 = vl.createTransformedArea(v2); 
Rectangle v4 = v3.getRectangle2D(); 
return v4; 


and double types to the KeY type system, together with an enum type for the 
different rounding modes of the IEEE 754 Standard. 

We further introduced functions and predicate symbols to formalize opera- 
tions (+, x, ...) and comparisons (<, ==, ...) on floating-point expressions. The 
translation supports both code with and without the strictfp modifier. However, 
since the actual precision of non-strictfp operations is not known, the function 
symbols remain uninterpreted. We extended KeY’s parser to correctly handle 
programs and annotations containing floating-point numbers, and added logic 
rules for translating floating-point expressions from Java or JML to JavaDL. 

As an example, Listing 1.1 shows JML specifications of our Rectangle bench- 
mark that contains floating-point literals and makes use of the fp_nan and fp_nice 
predicates. fp_nan states that a floating-point expression is NaN and fp_nice, 
which is shorthand for “not infinity and not NaN”, states that a floating-point 
expression is not NaN or infinity. The scale method contains two contracts that 
are checked separately, ensuring that the class fields of a scaled rectangle object 
are not NaN, considering different preconditions. For the first contract, the SMT 
solver produces a counterexample. In the second, we bound inputs by concrete 
ranges that we picked arbitrarily and get the valid result. In practice, such ranges 
would come from the context, e.g. from the kind of rectangles that appear in an 
application, or from known ranges of sensor values. 
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Concerning discharging the resulting proof obligations, there were two main 
ways to consider. One is to create a floating-point theory within KeY by adding 
axioms and deduction rules, so that the desired properties can be proven in 
KeY’s sequent calculus. The other way is to translate the proof obligations from 
JavaDL to SMT-LIB and call an external SMT solver. While the KeY approach 
traditionally favors conducting proofs within KeY, for this work, we partially 
deviated from this way in order to harness the greater experience and efficiency of 
SMT solvers when it comes to floating-point arithmetic. Our approach attempts 
to get the best of both worlds by distinguishing between basic floating-point 
arithmetic, i.e., elementary operations and comparisons, and more complex 
functions which do not have an SMT-LIB equivalent (e. g., the transcendental 
functions), or where the SMT-LIB function is not usefully implemented by current 
SMT solvers (see Section 3.2.B). 

Elementary operations and comparisons get translated to the corresponding 
SMT-LIB functions. In SMT-LIB, all floating-point computations conform to the 
IEEE 754 Standard. Therefore, only Java programs with the strictfp modifier 
can be directly translated to SMT-LIB without loss of correctness. 

We developed a translation from KeY’s floating-point theory to SMT-LIB. 
In order to integrate it into KeY, we also overhauled the existing translation 
from JavaDL to SMT-LIB to create a new, more modular framework, which 
now supports all the features of the original translation, e. g., heaps and integer 
arithmetic, but also floating-point expressions at the same time. 

Floating-point intricacies sometimes require extra caution. For example, there 
are two different notions of equality for floats: bitwise equality and IEEE754 
equality. Our implementation ensures these are distinguished correctly, and that 
the specification language remains intuitive for a developer to use. 

Using the translation to SMT-LIB, we can specify and prove two classes of 
properties in KeY: The absence of special values is specified using the fp_nan and 
fp_infinite predicates (or the fp_nice equivalent). Furthermore, one can specify 
functional properties that are expressible in floating-point arithmetic, e.g. one 
can compare the result of a computation against the result of a different program 
which is known to produce a good result or a reference value. 


3.2 Transcendental Functions 


Floating-point decision procedures in SMT solvers successfully handle programs 
consisting of arithmetic and square root operations. Many numerical real-world 
programs, however, include transcendental functions such as sin and cos. In Java 
programs, these functions are implemented as static library functions in the class 
java. lang.Math. 

Unlike arithmetic operations, transcendental functions are much more loosely 
specified by the IEEE 754 Standard—only an upper bound on the roundoff 
error is given. Libraries are thus free to provide different implementations, and 
even tighter error bounds. Exact reasoning in the same spirit as floating-point 
arithmetic would thus have to encode a specific implementation. Given that these 
implementations are highly optimized, this approach would be arguably complex. 
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We observe, however, that such exact reasoning about transcendental functions is 
often not necessary and a sound approximate approach is sufficient and efficient. 

In this section, we introduce an axiomatic approach for reasoning about 
programs containing transcendental functions. We observe that with the flexibility 
of deductive verification and KeY itself, we can instantiate it in two different ways. 
We encode transcendental functions as uninterpreted functions and axiomatize 
them in the SMT queries. Alternatively, we encode these axioms in KeY as logical 
inference rules. 


(A) Axiomatization in SMT We encode library functions as uninterpreted 
functions and include a set of axioms in the SMT-LIB translation for each 
method that is called in a benchmark. That is, we extended KeY such that when 
a transcendental function exists in the proof obligation, its definition alongside 
all the axioms for that function are added to the translation. 

For the axiomatization of transcendentals, we did not add rules that expand 
to a definition or allow a repeated approximation of the function value (like 
expansion into a Taylor series). Instead, we added a number of lemmata encoding 
interesting properties related to special values. For instance, the following axiom 
states that if the input to the sin function is not a NaN or infinity, then the 
returned value of sin is between —1.0 and 1.0: 


(assert (forall ((a Float64)) (=> 
(and (not (fp.isNaN a)) (not (fp.isInfinite a))) 
(and (fp.leq (sinDouble a) (fp #b0 #b01111111111 #b0000. ..000000) ) 
(fp.geq (sinDouble a) (fp #b1 #b01111111111 #b0000. ..000000)))))) 


Note that this implies that the result is not a NaN or infinity. The other axioms 
are similar in spirit, so we do not list them. 

These axioms are expressed as quantified floating-point formulas and capture 
high-level properties of library functions complying with the specifications in the 
IEEE 754 Standard. Clearly, since we do not have the actual implementations of 
these functions, we are not able to prove arbitrary properties. However, such an 
axiomatization is often sufficient to check for the (absence of) special values, i.e. 
NaN and infinity, as our experiments in Section 4.4 show. 


(B) Taclets in KeY Reasoning about quantified formulas in SMT is a long- 
lasting challenge [34]. We have also observed in our experiments with only 
arithmetic operations (Section 4.3) that SMT solvers struggle with quantifiers in 
combination with floating-points. We have therefore implemented an alternative 
approach encoding the axioms not in the SMT queries, but instead as deductive 
inference rules (so-called taclets) in KeY. 

The rules encode the same logical information as the universally quantified 
assertions that we add in SMT-LIB (and where we leave the choice of instantia- 
tions entirely to the SMT/SAT solver). With our taclet approach, we instantiate 
a quantifier (only) to one’s needs. We note that for proving a property correct, 
this results in a correct (under)approximation. However, the prize for achieving 
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Benchmark Details Automode Statistics 
bendho 4 classes # method # arith. library # goals closed +# goals to be # rules automode 
f calls ops functions by KeY closed externally applied time (s) 
Complex.add (2) 0 - 3/3 1/4 185 / 286 0.7/0.2 
Complex.divide (2) 0 1l - 10 / 8 2/8 483 / 625 0.7 / 0.8 
Complex.compare 0 2 - 3 2 216 0.2 
Complex.reciprocal (2) 1 - 1/1 2/2 402 / 406 0.4 / 0.5 
Circuit.impedance 2 1 3 - 1 4 360 0.5 
Circuit.current (2) 2 3 l4 - 117 11 4/1 1267 / 1238 4.0 / 4.1 
Matrix2.transposedEq 1 3 3 - 3 1 735 0.9 
Matrix3.transposedEq 4 34 - 3 1 1786 5.1 
Matrix3.transposedEqV2 4 34 - 3 1 1796 5.4 
Rectangle.scale (2) 3+ 23 220 - 32 / 32 32 / 16 5990 / 5617 18.4 / 14.5 
Rotate.computeError 1+ 6 26 - 108 8 3693 74.2 
Rotate.computeRelErr 1+ 6 28 - 120 8 3898 79.6 
FPLoop.fploop 0 1 - 2 4 99 0.1 
FPLoop.fploop2 0 1 - 2 4 99 0.1 
FPLoop.fploop3 0 1 - 2 4 99 0.1 
Cartesian.toPolar 2+ 3 6 sqrt, atan 1 4 438 0.5 
Cartesian.distanceTo 1+ il 5 sqrt 2 1 191 0.1 
Polar.toCartesian 2+ 3 4 cos, sin 1 2 364 0.5 
Circuit.instantCurrent 2+ 14 23 sqrt, atan, cos 17 2 1686 14.1 
Circuit.instant Voltage 1+ 1 4 cos 0 2 138 0.1 
Table 1: Benchmark details and KeY automode statistics, time is measured in 
seconds 


more closed proofs and shorter running times is that for disproving a prop- 
erty, not considering all possible quantifier instantiations may lead to spurious 
counterexamples, i.e., false positives. 

A heuristic strategy applies the rules automatically using the occurrences 
of transcendentals as instantiation triggers. However, instantiating the axioms 
too eagerly, considerably increases the number of open goals, which is why we 
assume that the user selects the axioms to apply manually (and did so in the 
experiments). After the application the proof obligation can either be closed, i.e 
proven, by KeY automatically, or be given to the SMT solver as before for final 
solving. 

Currently, the set of axioms (in the SMT-LIB translation and as taclets in 
KeY) only contains axioms for the transcendental functions occurring in our 
benchmarks. So far we have 10 axioms; however, adding more axioms (also for 
further transcendentals like exponentiation or logarithm) is straightforward. The 
full set of axioms is included in the Appendix of the technical report [3]. 


4 Evaluation 


4.1 Benchmark Programs 


We collected a set of existing floating-point Java programs representing real- 
world applications in order to evaluate the feasibility and performance of KeY’s 
floating-point support. 

The left half of Table 1 provides an overview of our benchmarks. Each 
benchmark consists of one method, which is composed of arithmetic operations 
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Listing 1.2: The Circuit.instantCurrent benchmark 


public class Circuit { 
double maxVoltage, frequency, resistance, inductance; 
Ef axes 


/*@ public normal_behavior 
@ requires 1.0 < this.maxVoltage && this.maxVoltage < 12.0 & 
@ 1.0 < this. frequency & this. frequency < 100.0 & 
@ 1.0 < this.resistance && this.resistance < 50.0 & 
@ 0.001 < this.inductance && this.inductance < 0.004 && 
@ 0.0 < time && time < 300.0; 
@ ensures !\fp_nan(\result) && !\fp_infinite(\result); 
@*/ 
public double instantCurrent(double time) { 
Complex current = computeCurrent(); 
double maxCurrent = Math.sqrt(current.getRealPart() * current.getRealPart() + 
current.getImaginaryPart() * current.getImaginaryPart()); 
double theta = Math.atan(current.getImaginaryPart() / current.getRealPart()); 
return maxCurrent * Math.cos((2.0 * Math.PI » frequency * time) + theta); 


}} 


and method calls to potentially other classes. The invocations of methods from 
java.lang.Math (e.g. Math.abs) are marked by “+1” in Table 1; these are resolved 
by inlining the method implementation. For benchmarks that contain calls to 
transcendental functions and square root, the called functions are listed; these are 
handled by our axiomatization. We include sqrt in this list, as we have observed 
that exact support can be expensive, so it may be advantageous to handle sqrt 
axiomatically. Benchmarks Rectangle, Circuit, Matrix3 and Rotation are partially 
shown in Listings 1.1, 1.2, 1.3 and 1.4 respectively. 

Each benchmark also includes a JML contract that is to be checked. For 
some methods, we specify two contracts (marked by “(2)” in the first column 
of Table 1), each serving as an independent benchmark. The contracts for most 
of these benchmarks check that the methods do not return a special value i.e 
infinity and/or NaN, the preconditions being that the variables are not themselves 
special values and possibly are bounded in a given range. For the Matrix, FPLoop 
and Rotate benchmarks, we check a functional property (see Section 4.3). FPLoop, 
which has three contracts, additionally shows how to specify floating-point loop 
behavior using loop invariants. 


4.2 Proof Obligation Generation 


To reason about the contract of a selected benchmark, we apply KeY, which 
generates proof obligations or ‘goals’. Some of these goals (heap-related) are 
closed by KeY automatically. The remaining open goals are closed by either SMT 
solvers with floating-point support directly (Section 3.1 and Section 3.2.A), or 
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with a combination of transcendental KeY taclets and floating-point SMT solving 
(Section 3.2.B). 

Columns 6 and 7 in Table 1 show the number of proof obligations closed by 
KeY directly and to be discharged by external solvers, respectively. The next two 
columns show the number of taclet rules that KeY applied in order to close its 
goals, and the time this takes. For benchmarks with two contracts we show the 
respective values separated by ‘/’. 

We run our experiments on a server with 1.5 TB memory and 4x12 CPU cores 
at 3 GHz. However, KeY runs single-threadedly and does not use more than 8GB 
of memory. 

For our set of benchmarks, the symbolic execution process is fully automated. 
Note that the machinery can deal with loop invariants, if they are provided. Loop 
invariant generation is, however, particularly challenging for floating-points due 
to roundoff errors [27,40], and a research topic in itself. 


4.3 Evaluation of SMT Floating-Point Support 


Previous work [32] reported that SMT support for floating-point arithmetic is 
rather limited. However, with recent advances [18], we evaluate the situation 
again. Most benchmarks used to evaluate SMT solvers’ decision procedures [1] 
aim to check (individual) specialized (corner case) properties of floating-point 
arithmetic. The proof obligations generated from our set of benchmarks are 
complementary in that they are more arithmetic heavy, while nonetheless relying 
on accurate reasoning about special values and functional properties. 

For each open goal not automatically closed, KeY generates one SMT-LIB 
file that is fed to the solvers for validation. We compare the performance of the 
three major SMT solvers with floating-point support CVC4 [8] (version 1.8, with 
the SymFPU library [18] enabled), Z3 (4.8.9) [28] and MathSAT (5.6.3) [22]. For 
this we set a timeout of 300s for each proof obligation. While KeY is able to 
discharge proof obligations in parallel, for our experiments, we do so sequentially 
to maintain comparability. 

KeY’s default translation to SMT includes quantifiers. These quantifications 
are not related to floating-point arithmetic, but are used to logically encode 
important properties of the Java memory model, like the type hierarchy and 
the absence of dangling references on any valid Java heap. If we reason about 
floating-point problems in isolation, they are not needed, but if we want to 
consider Java verification more holistically with questions combining aspects of 
heap and floating point reasoning, they become essential. We manually inspected 
that the proof obligations without our axiomatized treatment of transcendental 
functions do not depend on these properties and investigate the quantifier support 
by including or removing them from the SMT translations. We do not report 
results with quantifiers for MathSAT, since it does not support them. 

Table 2 summarizes the results of our experiments. Column 4 shows the 
number of expected valid or invalid goals for all benchmarks. For each solver we 
show the number of goals that each solver can validate or invalidate, together 
with the average time (in seconds) needed. The goals resulting in timeout were 
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ind , i quantified 1 CVC4 Z3 MathSAT 

ee ee ewe axioms 7” 8°78 # goals decided avg. # goals decided avg. # goals decided avg. 
1 valid v 80 79 4.1 25 18.4 - - 
2 contracts X 80 79 4.0 52 35.0 80 8.8 
3 invalid v 9 0 3.4 0 3.4 - - 
4 contracts X 9 8 36.7 7 27.6 9 3.9 
5 axioms in SMT y 10 9 33.2 4 63.4 - - 
6 axioms as taclets X 10 10 33.4 5 74.2 8 0.9 
7 fp.sqrt x 7 7 46.2 1 23.5 5 0.4 
8  axiomatized sqrt x 7 5 2.4 5 282.8 5 5.7 


Table 2: Summary of valid / invalid goals correctly decided and average running 
times of each solver for the SMT translations with and without quantified axioms 
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excluded from the computation of the average time. Column 3 shows whether 
the SMT queries include quantifiers or not. 

Rows 1 and 2 of Table 2 show the results for benchmarks with valid contracts. 
This experiment thus represents the common behavior of KeY, whose main goal 
is to prove contracts correct. Rows 3 and 4 of Table 2 demonstrate the results 
for benchmarks with invalid contracts, i.e. for those we expect a counterexample 
for at least one of the goals. The Appendix of the technical report [3] contains 
the detailed results for each experiment separated by benchmark. Figure 1 and 
Figure 2 show a more detailed view of the solvers’ running time for the valid 
benchmarks. The x-axis shows the number of open goals that are discharged by 
the SMT solvers, sorted by running time for each solver individually. The k-th 
point of one graph shows the minimum running time needed by the solver to 
close each of the k fastest goals. Note that each solver may have different goals 
which are its k fastest. The y-axis shows the time on a logarithmic scale. 

We conclude that in the presence of quantified axioms and floating-point 
arithmetic solvers’ performance deteriorate for both valid and invalid goals. 
In particular, none of the solvers is able to find counterexamples for any of 
the invalid goals. However, when the quantified axioms are removed from the 
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SMT translations, their performance improves. For valid contracts, CVC4 and 
MathSAT perform better than Z3, in terms of both number of goals validated 
and the running time per goal. In particular, MathSAT is able to prove all goals. 
However, the running time performance of CVC4 is better than MathSAT’s. For 
invalid contracts, solvers are able to produce the expected counterexamples at 
least partially. Particularly, MathSAT has a better performance than CVC4 and 
Z3 in terms of both running time and the number of proof obligations for which 
it can produce counterexamples. 

We conducted another experiment on our Rectangle.scale benchmark to assess 
the solvers’ sensitivity to various changes, applied to the benchmark’s contract 
or its implementation. We considered modifications such as reducing the number 
of classes while keeping the same functionality, having tighter and larger bounds 
for variables, reducing the number of arithmetic operations etc. The details of 
this experiment can be found in the Appendix of the technical report [3]. In 
summary, solvers’ performance seems to be sensitive to slight innocuous looking 
changes such as the number of classes involved and variable bounds. For example, 
constraining arg2 in the original benchmark more tightly allows CVC4 to validate 
all goals (1 more). This behavior could be potentially exploited by e.g. relaxing a 
variable’s bounds. 


Proving Functional Properties Listings 1.3 and 1.4 show examples of functional 
properties that are expressible in floating-point arithmetic and that KeY can 
handle. The verification results are included in rows 1 and 2 of Table 2, for more 
details see the Appendix of the technical report [3]. 

For Matrix, we check that the determinants of a matrix and its transpose 
are equal. Note that this property holds trivially under real arithmetic, but 
not necessarily under floating-points. After feeding transposedEq (which uses the 
determinant method) and its contract to KeY, increasing the default timeout 
sufficiently and discharging the created goal, CVC4 generates a counterexample 
in 170.2s seconds and MathSAT in 16.2s. Z3 times out after 30 minutes. By 
feeding transposedEqv2 (which uses the determinantNew method) to KeY, CVC4 
validates the contract in 1.1s, MathSAT in 3.9s and Z3 times out again. One 
thing worth noting is that the way programs are written can greatly influence the 
computational complexity needed to reject or verify the contract. This is evident 
from the fact that slightly modifying the order of operations (using determinantNew 
instead) substantially reduces verification time and changes the verification result 
for MathSAT and CVC4. 

For Rotate, we check that the difference between an original vector and the 
one that is rotated four times by 90 degrees, must not be larger than 1.0E-15. 
We also verified the same bound for the relative difference (by exploiting another 
method and contract) for this benchmark. The constant cos90 in Listing 1.4 is 
not precisely 0.0 to account for rounding effects in the computation of the cosine. 
FPLoop includes three loops, for which the contracts check that the return value 
is bigger than a given constant. 

Though not always very fast, these examples show that verification of func- 
tional floating-point properties is viable. 
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Listing 1.3: The Matrix3 benchmark 


public class Matrix3 { 
double a, b, c, d, e, f, g, h, i; //The matrix: [[a b c],[d e f],[g h iJ] 
double det; 
// method transpose not shown 


double determinant() { 
return (a*x*ex*xi+b*f*g+tcecxdx*h) - 
(c*e*x gtb*ed*eit+ax* f *h); 
} 
double determinantNew() { 
return (a * (e * i) + (g * (b * f) +c» (d * h))) - 
(e x (cx g) + (i * (b * d) + a * (f * h))); 
} 
/*@ ensures \fp_normal(\result) ==> (\result == det); @*/ 
double transposedEq() { 
det = determinant(); 
return transpose().determinant(); 
} 
/*@ ensures \fp_normal(\result) ==> (\result == det); @*/ 
double transposedEqv2() { 
det = determinantNew(); 
return transpose().determinantNew(); 


Listing 1.4: The Rotation benchmark 


public class Rotation { 
final static double cos90 = 6.123233995736766E-17; 
final static double sin90 = 1.0; 


// rotates a 2D vector by 90 degrees 
public static double[] rotate(double[] vec) { 
double x = vec[0] * cos90 - vec[1] * sin90Q; 
double y = vec[0] * sin90 + vec[1] * cos90; 
return new double[]{x, y}; 
} 
/*@ requires (\forall int i; 0 <= i && i < vec.length; 
@ \fp_nice(vec[i]) && vec[i] > 1.0 && vec[i] < 2.0) && vec. length == 2; 
@ ensures \result[0] < 1.0EF-15 && \result[1] < 1.0E-15; 
*/ 
public static double[] computeError(double[] vec) { 
double[] temp = rotate(rotate(rotate(rotate(vec)))); 
return new double[]{Math.abs(temp[0] - vec[0]), Math.abs(temp[1] - vec[1])}; 
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4.4 Evaluation of Support for Transcendental Functions in KeY 


We evaluated the two approaches from Section 3.2.A on our set of benchmarks; 
rows 5 and 6 in Table 2 summarize the results. (The detailed results of these 
experiments are included in the Appendix of the technical report [3].) Note that 
both approaches are fully automated. 

We conclude that the SMT solvers perform better when the axiomatization 
is applied at the KeY level. When axioms for transcendental functions are added 
to the SMT-LIB translation directly Z3 validates 4 out of 10 goals. With the 
axiomatization at the KeY level, solvers are able to validate more goals (with 
quantified formulas removed from the SMT translations), e.g. Z3 is able to 
validate 5 goals and CVC4 can validate all. Therefore, it is preferable to apply 
them on the KeY side via taclet rules. 

All the solvers we have used in this work comply with the IEEE 754 standard 
and therefore have bit-precise support for the square root function. They provide 
bit-precise reasoning by effectively encoding the behavior of floating-point circuits 
over bitvectors (which is naturally expensive), together with different heuristics 
and abstractions to speed up solving time. However, depending on the property, we 
do not always need bit-precise reasoning, so we propose handling the square root 
function with the same taclet-based axiomatization as introduced in Section 3.2.B. 

To this end, we conducted an experiment on the benchmarks containing sqrt, 
comparing the approach from Section 3.2.B (adding the necessary axioms, resp. 
taclet rules) to using the square root implemented in SMT solvers (fp.sqrt). We 
chose to include only axioms specified in or inferred from the IEEE 754 standard 
(e.g. if the argument of the square root function is NaN or less than zero, then 
the square root results in NaN). The full set of axioms that we used is included 
in the Appendix of the technical report [3]. 

Rows 7 and 8 in Table 2 summarize the results for this experiment; the detailed 
results are included in the Appendix of the technical report [3]. We observed 
that for two out of the three benchmarks, the average running time of all solvers 
decreases using the axiomatized square root. Furthermore, Z3 is able to reason 
about more proof obligations with the axiomatized version. However, the success 
of this approach depends on the axioms added to KeY and may not always work 
if we do not have suitable axioms. For example, for the Circuit.instantCurrent 
benchmark (Listing 1.2), using the axiomatized square root, CVC4 is not able to 
validate the contract, but with fp.sqrt the contract is validated. 

In summary, treating sqrt axiomatically can result in shorter solving times 
than performing bit-precise reasoning, but the approach may not always succeed 
when the axioms are not sufficient to prove a particular property. 


4.5 Discussion and insights 


The experiments show that highly automated floating point program verification 
is viable for relevant properties (handling of special values and some functional 
properties), up to a certain level of complexity (given by the SMT solvers). The 
choices of which parts of a proof obligation are delegated to SMT, and how they 
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are translated to SMT, are crucial for achieving effective and efficient program 
verification. Arithmetic operations proved to be more efficiently dealt with by 
delegation to SMT, whereas for transcendental functions, axiomatization and 
rule based treatment in the theorem prover, outside the SMT solver, performs 
clearly better. 


5 Related Work 


Our implementation uses the floating-point SMT-LIB theory [17], which how- 
ever does not handle transcendental functions, as their semantics is (library) 
implementation dependent. Some real-valued automated solvers do handle tran- 
scendental functions [5,33], but to the best of our knowledge, the combination of 
floating-points and reals in SMT solvers is still severely limited. 

None of the existing deductive verifiers support floating-point transcendental 
functions automatically. The Why3 deductive verification framework [30] has 
support for floating-point arithmetic, with front-ends for the C and Ada pro- 
gramming languages through Frama-C [25] and SPARK [19,32], respectively. 
Whys has back-end support for different SMT solvers, as well as interactive proof 
assistants like Coq. Until recently, Why3 would discharge still many interesting 
floating-point problems with help of Coq, relying on significant user interaction. In 
later work [32] (in the context with floating-point verification for Ada programs), 
Why3 can achieve a higher degree of automation. Note, however, that the user is 
still required to add code assertions as well as ‘ghost code’ to a significant extent. 

The Boogie intermediate verification language [47] also supports floating- 
point expressions, and targets Z3 for discharging proof obligations. In the Boogie 
community, it was observed that writing a specification in Boogie leads to 
decreases in SMT solver performance when compared to writing the goal in 
SMT-LIB directly, probably due to an inherent mixing of theories when using 
Boogie [2]. This matches our own experiences, and separation of theories should 
be considered an important task for the further development of floating-point 
verification. 

Other deductive verifiers for Java have only rudimentary support for floating- 
points. Verifast [41] treats floating-point operations as if they were real values, 
and OpenJML [23] parses programs with floating-point operations, but essentially 
treats float and double as uninterpreted sorts. 

The Java category of verification competition SV-COMP [12] contains a num- 
ber of benchmarks that make use of floating-point variables. However, the focus 
of these benchmarks is usually not on arithmetical properties of expressions, but 
on the completeness of the Java language support. Amongst the participants of 
SV-COMP 2020, the Symbolic (Java) Pathfinder (SPF) [55] (and various exten- 
sions) and the Java Bounded Model Checker (JBMC) [24] support floating-point 
arithmetic. Besides being limited to exploring the state space up to a bounded 
depth, their constraint languages do not support quantifiers and abstracting of 
method calls—which are features that we have used in this work. 
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Floating-point arithmetic has also been formalized in several interactive 
theorem provers [16, 31,42]. While one can prove intricate properties about 
floating-point programs [14, 15,38], proofs using interactive provers are to a large 
part manual and require significant expertise. 

Abstract interpretation based techniques can show the absence of special 
values in floating-point code fully automatically, and several abstract domains 
which are sound with respect to floating-point arithmetic exist [20,43]. While the 
analysis itself is fully automated, applying it successfully to real-world programs 
in general requires adaptation to each program analyzed by end-users, e.g. the 
selection of suitable abstract domains or widening thresholds [13]. 

Besides showing the absence of special values, recent research has developed 
static analyses to bound floating-point roundoff errors [26,35,48,52,57]. These 
analyses currently work only for small arithmetic kernels and the tools in particular 
do not accept programs with objects. 

Dynamic analyses generally scale well on real-world programs, but can only 
identify bugs (when given failure-triggering input), rather than proving correctness 
for all possible inputs. Executing a floating-point program together with a higher- 
precision one allows one to find inputs which cause large roundoff errors [11,21,44]. 
Ariadne [7] uses a combination of symbolic execution, real-valued SMT solving 
and testing to find inputs that trigger floating-point exceptions, including overflow 
and invalid operations. Our work subsumes this approach as the SMT solvers 
that we use can directly generate counterexamples, but more importantly, KeY 
is able to prove the absence of such exceptions. 


6 Conclusion 


By joining the forces of rule-based deduction and SAT-based SMT solving, we 
presented the first working floating-point support in a deductive verification tool 
for Java and by that close a remaining gap in KeY to now support full sequential 
Java. Our evaluation shows that for specifications dealing with value ranges and 
absence of NaN and infinity, our approach can verify realistic programs within a 
reasonable time frame. We observe that the MathSAT and CVC4 solver’s floating- 
point support scales sufficiently for our benchmarks, as long as the queries do 
not include any quantifiers, and that our axiomatized approach for handling 
transcendental functions is best realized using calculus rules in KeY’s internal 
reasoning engine. While our work is implemented within the KeY verifier, we 
expect our approach to be portable to other verifiers. 
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Abstract. A smart contract is a program executed on a blockchain, 
based on which many cryptocurrencies are implemented, and is being 
used for automating transactions. Due to the large amount of money that 
smart contracts deal with, there is a surging demand for a method that 
can statically and formally verify them. 

This tool paper describes our type-based static verification tool HELM- 
HOLTZ for Michelson, which is a statically typed stack-based language 
for writing smart contracts that are executed on the blockchain platform 
Tezos. HELMHOLTZ is designed on top of our extension of Michelson’s 
type system with refinement types. HELMHOLTZ takes a Michelson pro- 
gram annotated with a user-defined specification written in the form 
of a refinement type as input; it then typechecks the program against 
the specification based on the refinement type system, discharging the 
generated verification conditions with the SMT solver Z3. We briefly 
introduce our refinement type system for the core calculus Mini-Michel- 
son of Michelson, which incorporates the characteristic features such as 
compound datatypes (e.g., lists and pairs), higher-order functions, and 
invocation of another contract. HELMHOLTZ successfully verifies several 
practical Michelson programs, including one that transfers money to an 
account and that checks a digital signature. 


1 Introduction 


A blockchain is a data structure to implement a distributed ledger in a trustless yet 
secure way. The idea of blockchains is initially devised for the Bitcoin cryptocur- 
rency [12] platform. Many cryptocurrencies are implemented using blockchains, 
in which value equivalent to a significant amount of money is exchanged. 
Recently, many cryptocurrency platforms allow programs to be executed on a 
blockchain. Such programs are called smart contracts [19] (or, simply a contract 
in this paper) since they work as a device to enable automated execution of a 
contract. In general, a smart contract is a program P, associated with an account 
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a on a blockchain. When the account a receives money from another account b 
with a parameter v, the computation defined in P, is conducted, during which 
the state of the account a (e.g., the balance of the account and values that are 
stored by the previous invocations of P,) which is recorded on the blockchain 
may be updated. The contract P, may execute money transactions to another 
account (say c), which results in invocations of other contracts (say P.) during 
or after the computation; therefore, contract invocations may be chained. 

Although smart contracts’ original motivation was handling simple transac- 
tions (e.g., money transfer) among the accounts on a blockchain, recent contracts 
are being used for more complicated purposes (e.g., establishing a fund involving 
multiple accounts). Following this trend, the languages for writing smart con- 
tracts also evolve from those that allow a contract to execute relatively simple 
transactions (e.g., Script for Bitcoin) to those that allow a program that is 
as complex as one written in standard programming languages (e.g., EVM for 
Ethereum and Michelson [1] for Tezos [4]). 

Due to a large amount of money they deal with, verification of smart contracts 
is imperative. Static verification is especially needed since a smart contract 
cannot be fixed once deployed on a blockchain. Attack on a vulnerable contract 
indeed happened. For example, the DAO attack, in which the vulnerability 
of a fundraising contract was exploited, resulted in the loss of cryptocurrency 
equivalent to approximately 150M USD [18]. 

In this paper, we describe our type-based static verifier HELMHOLTZ? for 
smart contracts written in Michelson. The Michelson language is a statically- and 
simply typed stack-based language equipped with rich data types (e.g., lists, maps, 
and higher-order functions) and primitives to manipulate them. Although several 
high-level languages that compile to Michelson are being developed, Michelson is 
most widely used to write a smart contract for Tezos as of writing. 

A Michelson program expresses the above computation in a purely functional 
style, in which the Michelson program corresponding to P, is defined as a function. 
The function takes a pair of the parameter v and a value s that represents the 
current state of the account (called storage) and returns a pair of a list of 
operations and the updated storage s’. Here, an operation is a Michelson value 
that expresses the computation (e.g., transferring money to an account and 
invoking the contract associated with the account) that is to be conducted after 
the current computation (i.e., P,) terminates. After the computation specified 
by P, finishes with a pair of a storage value and an operation list, a blockchain 
system invokes the computation specified in the operation list. This purely 
functional style admits static verification methods for Michelson programs similar 
to those for standard functional languages. 

As the theoretical foundation of HELMHOLTZ, we design a refinement type 
system for Michelson as an extension of the original simple type system. In 
contrast to standard refinement types that refine the types of values, our type 


3 Hermann von Helmholtz (1821-1894), a German physicist and physician, was a 
doctoral advisor of Albert A. Michelson (1852-1931), whom the Michelson language 
is apparently named after. 
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system refines the type of stacks. We briefly describe our type system in Section 3; 
a detailed explanation is deferred to a future paper. 

We show that our tool can verify several practical smart contracts. In addition 
to the contracts we wrote ourselves, we apply our tool to the sample Michelson 
programs used in Mi-cho-coq [3], a formalization of Michelson in Coq proof 
assistant [21]. These contracts consist of practical contracts such as one that 
checks a digital signature and one that transfers money. 

We note that HELMHOLTZ currently supports approximately 80% of the 
whole instructions of the Michelson language. Another limitation of the current 
HELMHOLTZ is that it can verify only a single contract, although one often uses 
multiple contracts for an application, in which a contract may call another by a 
money transfer operation, and their behavior as a whole is of interest. We are 
currently extending HELMHOLTZ so that it can deal with more programs. 

Our contribution is summarized as follows: (1) Definition of the core calculus 
Mini-Michelson and its refinement type system; (2) Automated verification tool 
HELMHOLTZ for Michelson contracts implemented based on the type system 
of Mini-Michelson; the interface to the implementation can be found at https: 
//www.fos.kuis.kyoto-u.ac.jp/trylang/Helmholtz; and (3) Evaluation of HELM- 
HOLTZ with various Michelson contracts, including practical ones. 

The rest of this paper is organized as follows. Before introducing the technical 
details, we present an overview of the verifier HELMHOLTZ in Section 2 using a 
simple example of a Michelson contract. Section 3 introduces the core calculus 
Mini-Michelson and its refinement type system. Section 4 describes the verifier 
HELMHOLTZ, a case study, and experimental results. After discussing related 
work in Section 5, we conclude in Section 6. 


2 Overview of Helmholtz and Michelson 


We overview our tool HELMHOLTZ in this section before presenting its technical 
details. We also explain Michelson by example (Section 2.2) and user-written 
annotation added to a Michelson program for verification purposes (Section 2.3). 


2.1 Helmholtz 


As input, HELMHOLTZ takes a Michelson program annotated with (1) its specifi- 
cation expressed in a refinement type and (2) additional user annotations such 
as loop invariants. It typechecks the annotated program against the specification 
using our refinement type system; the verification conditions generated during 
the typechecking is discharged by the SMT solver Z3 [11]. If the code successfully 
typechecks, then the program is guaranteed to satisfy the specification. 

HELMHOLTZ is implemented as a subcommand of tezos-client, the client 
program of the Tezos blockchain. For example, to verify boomerang.tz in Figure 1, 
we run tezos-client refinement boomerang.tz. If the verification succeeds, 
the command outputs VERIFIED to the terminal screen (with a few log messages); 
otherwise, it outputs UNVERIFIED. 
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1 parameter unit; 

2 storage unit; 

3 << ContractAnnot { (param, st) | True } => 

4 { (ops, st’) | amount = 0 && ops = [] II 

5 amount <> 0 && ops = [Transfer Unit amount (Contract source)]} 

6 & { _ | False } >> 

7 code /* (param,st) */ 

8 { CDR; /* st */ 

9 NIL operation; /* [] > st */ 

0 AMOUNT; /* amount >œ [] > st */ 

‘I: PUSH mutez 0; /* O D> amount > [] > st */ 

2 IFCMPEQ 

3 { Iw EI B® st (amount < 0) */ } 

4 { /* 0 & st (amount > 0) */ 

5 SOURCE; J* sro rt [] t at «7 

6 CONTRACT unit; /* Some (Contract src) > [] > st */ 

7 ASSERT_SOME; /* (Contract sre) p [] p st */ 

8 AMOUNT; UNIT; /* Unit > amount > (Contract src) > [] > st */ 

9 TRANSFER_TOKENS; /* Transfer Unit amount (Contract src) > [] > st */ 
20 CONS /* [Transfer Unit amount (Contract src)] > st */ 
21 F; 

22 /* ops >œ st, where ops is the top element at the end of each branch, namely, 
23 [Transfer Unit amount (Contract src)] if amount > 0; or [] otherwise */ 
24 PAIR /* (ops, st) */ 

25 } 


Fig. 1. boomerang.tz. The comment inside /* */ describes the stack at the program 
point. 


2.2 An Example Contract in Michelson 


Figure 1 shows an example of a Michelson program called boomerang. A Michelson 
program is associated with an account on the Tezos blockchain; the program is 
invoked by transferring money to this account. This artificial program in Figure 1, 
when it is invoked, is supposed to transfer the received money back to the account 
that initiated the transaction. 

A Michelson program starts with type declarations of its parameter, whose 
value is given by contract invocation, and storage, which is the state that the 
contract account stores. Lines 1-2 declare that the types of both are unit, the 
type inhabited by the only value Unit. Lines 3—6 surrounded by << and >> are a 
user-written annotation used by HELMHOLTZ for verification; we will explain this 
annotation later. The code section in Lines 8—24 is the body of this program. 

Let us take a look at the code section of the program. In the following 
explanation of each instruction, we describe the state of the stack after each 
instruction as comments; stack elements are delimited by >. 


— Execution of a Michelson program starts with a stack with one value, which 
is a pair (param, st) of a parameter param and a storage value storage. 

— CDR pops the pair at the top of the stack and pushes the second value of the 
popped pair; therefore, after executing the instruction, the stack contains the 
single value st. 

— NIL pushes the empty list [] to the stack; the instruction is accompanied by 
the type operation of the list elements for typechecking purposes. 
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AMOUNT pushes the nonnegative amount of the money sent to the account to 
which this program is associated. 

PUSH mutez 0 pushes the value 0. The type mutez represents a unit of money 
used in Tezos. 

IFCMPEQ b1 b2, if the state of the stack before executing the instruction 
is v1 > v2 > tl, (1) pops v1 and v2 and (2) executes the then-branch b1 
(resp., the else-branch b2) if v2 = v1 (resp., v2 Æ v1). In boomerang, this 
instruction does nothing if amount = 0; otherwise, the instructions in the 
else-branch are executed. 

SOURCE at the beginning of the else-branch pushes the address src of the 
source account, which initiated the chain of contract invocations that the 
current contract belongs to, resulting in the stack src > [] > st. 

CONTRACT T pops an address addr from the stack and typechecks whether the 
contract associated with addr takes an argument of type T. If the typechecking 
succeeds, then Some (Contract addr) is pushed; otherwise, None is pushed. 
The constructor Contract creates an object that represents a typechecked 
contract at the given address. In Tezos, the source account is always a contract 
that takes the value Unit as a parameter; thus, Some (Contract src) will 
always be pushed onto the stack. 

ASSERT_SOME pops a value v from the stack and pushes v’ if v is Some v’; 
otherwise, it raises an exception. 

UNIT pushes the unit value Unit to the stack. 

TRANSFER_TOKENS, if the stack is of the shape varg > vamt > vcontr > 
tl, pops varg, vamt, and vcontr from the stack and pushes (Transfer 
varg vamt vcontr) onto tl. The value Transfer varg vamt vcontr is an 
operation object expressing that money (of amount vamt) shall be sent to 
the account vcontr with the argument varg after this program finishes 
without raising an exception. Therefore, the program associated with vcontr 
is invoked after this program finishes. 

CONS with the stack v1 > v2 > tl pops v1 and v2, and pushes a cons list 
vi::v2 onto the stack. (We use the list notation in OCaml here.) 

After executing one of the branches associated with IFCMPEQ in this program, 
the shape of the stack should be ops > storage, where ops is [] if amount 
= 0 or [Transfer varg vamt vcontr] if amount > 0. The instruction PAIR 
pops ops and storage, and pushes (ops, storage). 


A Michelson program is supposed to finish its execution with a singleton stack 
whose unique element is a pair of (1) a list of operations to be executed after the 
current execution of the contract finishes and (2) the new value for the storage. 


Michelson is a statically typed language. Each instruction is associated with a 


typing rule that specifies the shapes of stacks before and after it by a sequence of 
simple types such as int and int list. For example, CONS requires the type of 


top 
the 


element to be T and that of the second to be T list (for any T); it ensures 
top element after it has type T list. 
Other notable features of Michelson include first-class functions, hashing, 


instructions related to cryptography such as signature verification, and manipu- 
lation of a blockchain using operations. 
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2.3 Specification 


A user can specify the behavior of a program by a ContractAnnot annotation, 
which is a part of the augmented syntax of our verification tool. A ContractAnnot 
annotation gives a specification of a Michelson program by the following no- 
tation inspired by the refinement types: {(param,st) | pre} -> {(ops,st’) 
| post} & {exc | abpost} where pre, post, and abpost are predicates. This 
specification reads as follows: if this program is invoked with a parameter param 
and storage st that satisfies the property pre, then (1) if the execution of this pro- 
gram succeeds, then it returns a list of operations ops and new storage storage’ 
that satisfy the property post; (2) if this program raises an exception with value 
exc, then exc satisfies abpost. The specification language is expressive enough 
to cover the specifications for practical contracts, including the ones we used in 
the experiments in Section 4.3. In the predicates, one can use several keywords 
such as amount for the amount of the money sent to this program when it is 
invoked and source for the source account’s address. 

The ContractAnnot annotation in Figure 1 (Lines 3-6) formalizes this pro- 
gram’s specification as follows. This program can take any parameter and storage 
(Line 3). Successful execution of this program results in a pair (ops,st’) that 
satisfies the condition in Lines 4-5 that expresses (1) if amount = 0, then ops is 
empty, that is, no operation will be issued; (2) if amount > 0, then ops is a list of 
a single element Transfer Unit amount (Contract source), which expresses 
transfer of money of the amount amount to the account at source with the unit 
argument.’ In the specification language, source and amount are keywords that 
stand for the source account and the amount of money sent to this program, 
respectively. The part & { _ | False } expresses that this program does not 
raise an exception. This specification correctly formalizes the intended behavior 
of this program. 


3 Refinement Type System for Mini-Michelson 


In this section, we formalize Mini-Michelson, a core subset of Michelson with its 
syntax, operational semantics, and refinement type system. We also state that 
the type system is sound. We omit many features from the full language in favor 
of conciseness but includes language constructs—such as higher-order functions 
and iterations—that make verification difficult. 

Figure 2 shows the syntax of Mini-Michelson. Values, ranged over by V, 
consist of integers i; addresses a; operations transaction (V,i,a) to invoke a 
contract at a by sending money of amount i and an argument V; pairs (V1, V2) 
of values; the empty list []; cons Vi :: Vo; and code (JS) of first-class functions.” 


4 As we mentioned in Section 1, HELMHOLTZ can currently verify the behavior of a 
single contract, although there will be an invocation of the contract associated with 
source after the termination of boomerang. An operation is treated as an opaque 
data structure, from which one cannot extract values. 

5 Closures are not needed because functions in Michelson can access only arguments. 
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V = i | a | transaction (V,i,a) | (Vi,V2) | [] | Viz: Ve | (2S) 

T ::= int | address | operation | Ti x To | T list | T, > To 

TS n= {hh; ...;In} 

I = IS | DIPIS | DROP | DUP | SWAP | PUSHT'V | NOT | ADD | IFIS: IS2 | 
LOOP JS | PAIR | CAR | CDR | NILT | CONS | IF_CONS JS, [S2 | ITERIS | 
LAMBDA Tı T2 IS | EXEC | TRANSFER_TOKENS T 


Fig. 2. Syntax of Mini-Michelson 


Unlike Michelson, we use integers as a substitute for Boolean values so that 0 
means false and the others mean true. Simple types, ranged over by T, consist of 
base types (int, address, and operation, which are self-explanatory), pair types 
Tı x To, list types T list, and function types T) > To. Instruction sequences, 
ranged over by IS, are a sequence of instructions, ranged over by I, enclosed by 
curly braces. A Mini-Michelson program is an instruction sequence. 

Instructions include those for stack manipulation (to DROP, DUPlicate, SWAP, 
and PUSH values); NOT and ADD for manipulating integers; PAIR, CAR, and CDR for 
pairs; NIL and CONS for constructing lists; and TRANSFER_TOKENS to create an 
operation that expresses a money transfer after the current contract execution. 
The instruction IF branches depending on whether the stack top is 0 or not; 
IF_CONS branches on whether the stack top is a cons or not. The instruction 
LOOP JS repeats IS as long as the stack top is a nonzero integer at the loop 
entry; ITER JS is for iterating the list at the stack top. LAMBDA pushes a function 
(described by its operand JS) onto the stack, and EXEC calls a function. Perhaps 
unfamiliar is DIP JS, which pops and saves the stack top somewhere else, executes 
IS, and then pushes the saved value back. 

We also use a few kinds of stacks in the following definitions: value stacks, 
ranged over by S, type stacks, ranged over by T, and type binding stacks, ranged 
over by T, of the form xı : Tı > .. > £n : Tn. The empty stack is denoted by t, 
and push is by >. We often omit the empty stack and write, for example, Vi > V2 
for Vi > V2 et. Intuitively, Tı > .. > Tn and zı : Ti > .. > £n : Tn describe stacks 
Vi > .. > Vp where each value V; is of type T;. We will use variables to name stack 
elements in the refinement type system. 

Mini-Michelson (as well as Michelson) is equipped with a simple type system. 
The type judgment for instructions is written T + I > T’, which means that 
instruction J transforms a stack of type T into another stack of type T’. The 
type judgment for values is written V : T, which means that V is given simple 
type T. We omit typing rules as they are fairly straightforward. 


3.1 Operational Semantics 


We give a big-step operational semantics of Mini-Michelson by defining the 
judgment S F I |) S’, which means that executing the instruction J under the 
stack S results in the stack S$’, (and also S F IS |} S’). Most rules for SH I |) S 
are straightforward. We show rules for DIP and LOOP below and omit other rules. 
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SEIS S SEIS) S’ S’ + LOOP IS 4 S” (i #0) 
VoSHDIPISYVo S i> S H LOOP IS |) S” 0> St LOOPIS | S 
The first rule means that the body IS is executed with the stack S obtained by 
removing the top element V, which is pushed back onto the resulting stack S”. 
There are two rules for LOOP: the first rule means that if the stack top is nonzero, 
then the body is executed, and then the execution of LOOP JS is repeated; the 
second rule means that, if the stack top is zero, then the loop acts as a no-op. 


3.2 Refinement Type System 


In the refinement type system, a simple stack type Tı > .. > Th is augmented 
with a formula ọ of first-order logic to describe the relationship among stack 
elements. We introduce refinement stack types, ranged over by ®, of the form 
{ay : Ty >... Pan: Tn | pli, ...,2n)}, which denotes stacks V; > .. > Vp such 
that Vi: T1,..., Vn : Tn and (V1, ..., Vn) hold. 

We show (part of) the syntax of terms and formulae of the first-order logic: 


t = x | V | transaction (t1, t2,t3) | tı ite | (ti,te) | ti tte | > 
p == tı =te | call(ti,te)=ts | y1 V go | ay | Ix:Ty |- 


The language for predicates is multi-sorted, where a sort is a simple type of 
Michelson. The sorting rules for term constructors and relation symbols are 
standard. For example, in tı + t2, both tı and tə have to be of sorts int; and in 
tı = te, the sorts of tı and t2 must be the same, and so on. The only relation 
symbol worth explaining is call (¢,, t2) = t3, which informally means that calling 
function tı with argument tə (as the only element of the input stack) yields a 
stack consisting only of t3 as a result. We use other predicates, connectives, and 
quantifiers such as tı Æ t2, %1 A Y12, G1 => p2, and Vx: T.y, which can be 
considered as derived forms. 

We define the semantics of the formulae in a standard manner. Let o be a value 

assignment, i.e., a sort-respecting finite map from variables to values. We define 
the interpretation [t], of t under ø and valid formulae under a value assignment, 
denoted by ø — g; for call (tı, t2) = t3, we define o | call (t,,t2) = ts iff 
[t2]o > t F [ti]e  [ts]o >t. Equality on instruction sequences is intensional: 
formula (1S) = (IS") is valid only if IS and IS’ are syntactically equal. 
For a finite mapping I’ (called a type environment) from variables to sorts, 
Io and T f} ọ are defined as usual: I = o iff dom (o) = dom(I’) and 
a(x): T(x) for any x € dom (0); I — ọ iff o H yọ for any value assignment o 
such that I E ø. 

The type system is equipped with subtyping whose judgment is of the form 
I H @; <: P2, which means stack type ®; is a subtype of Pa under I’. The type 
judgment for instructions (resp. instruction sequences) is of the form I F ®; I 82 
(resp. I + &, IS 2), which means that, under I’, if I (resp. IS) is executed 
under a stack satisfying ®ı, the resulting stack (if the execution terminates) 
satisfies 2. We often call Bı pre-condition and Ba post-condition. 

We show representative typing rules in Figure 3. 
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T,£:TH {Y |} IS {Y|} 


RT-Dip 
CD {a:ToY| vp} dDIPIs {x:TreY’ | p} ( ) 
Te{Y|Aae:int.p AcAO0} 1S, P VR{Y |Aav:int.p A x=0} [Soh RT-Ir) 
-IF 
T H {w:intpY | p} IFS, IS2 ( 
CR{Y|dax:int.p Ac £0} IS :intp Y 
{T | Aa: intp A x £0} IS {x : into T | y} nites 


Ct {x:intpY | yp} LOOPIS {Y | da: int.p A x = 0} 


yi: Tih {y1 : Ti | y1 =u A gi} IS {u2 : To | P2} 


T F {T | p} LAMBDA T, To IS 
{£: T > TaY |p A Vy: Ti, y2 : T2.pı[yı := yj] A call (z, y1) = y2 => p2} 


(RT-LAMBDA) 


Fir {zı : Tiz: Ti > ToT | p} EXEC {x3 :ToeoY | da,:71,%2:T, > T2.p ^Acall (2,21) = z3} 
(RT-EXEC) 
DF O@, <: Si Cre IS, r H B, <: Bg 

r A- ıı I2 


(RT-Sus) 


Fig. 3. Typing rules (excerpt) 


— (RT-DiP) means that DIP IS is well typed if the body IS is typed under the 
stack type obtained by removing the top element. The popped value named 
x is moved to the type environment part so that it can be referred to in the 
refinement predicate y in the pre-condition. 

— (RT-IF) means that the instruction is well typed if both branches have the 
same post-condition; the pre-conditions of the branches are strengthened by 
the assumptions that the top of the input stack is true (x 4 0) and false 
(x = 0). The variable x is existentially quantified because the top element 
will be removed before the execution of either branch. 

— (RT-LoopP) is similar to the proof rule for while-loops in Hoare logic. The 
formula ¢ is a loop invariant. Since the body of LOOP is executed while the 
stack top is nonzero, the pre-condition for the body JS is strengthened by 
x #0, whereas the post-condition of LOOP JS is strengthened by x = 0. 

— (RT-LAMBDA) is for the instruction to push a first-class function onto the 
operand stack. The premise of the rule means that the body JS takes a 
value (named y1) of type T, that satisfies yı and outputs a value (named 
y2) of type Tə that satisfies y2 (if it terminates). The post-condition in the 
conclusion expresses, by using call, that the function x has the property 
above. The extra variable yj in the type environment of the premise is an 
alias of yı; being a variable declared in the type environment y| can appear 
in both yı and 2° and can describe the relationship between the input and 
output of the function. 

— (RT-ExEc) adds call (x2,x1) = 23 to the post-condition, meaning that the 
result of a call to the function x with xı as an argument yields x3. It may 
look simpler than expected; the crux here is that y is expected to imply 
Var: Tı, £3 : T2.p1 A call (z2, £1) = £3 = > (2, where yı and y2 represent 


6 The scope of a variable in a refinement stack type is its predicate part and so yı 
cannot appear in the post-condition. 
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the pre- and post-conditions, respectively, of function xs. If xı satisfies yj, 
then we can derive that p2 holds. 

— (RT-SUB) is the rule for subsumption to strengthening the pre-condition 
and weakening the post-condition. In our type system, subtyping is defined 
semantically: A subtyping judgment I F {Y | yi} <: {T | yo} holds if for any 
o such that Va € dom (I, Y).o(x) : (T, Y)(x), o H| pı => yə is valid. (Here, 
by abuse of notation, the type binding stack Y is regarded as a mapping from 
variables to sorts.) 


We state that our type system is sound: For a well-typed instruction, if we 
execute the instruction under a stack that satisfies the pre-condition of the typing, 
then (if the execution halts) the resulting stack satisfies the post-condition of the 
typing. To state the soundness theorem, we define an auxiliary relation I = S': &, 
which means “stack S satisfies stack refinement type ® under environment I”, 
by FEW. o Vm {yi Tle. bimi Th | ye} > VM: Mrs Vm: 
T’, and o[yi > Vi, --5 Ym +> Vm] E= y for any o such that Eo. 

Then, the soundness theorem, whose proof will appear in a forthcoming full 
version, is stated as follows: 


Theorem 1 (Soundness). If [+ ®, IS 2, rT ES: Bı, and SF IS | S, 
then T = S : Bə. 


Sketch of Typechecking We implement a typechecking algorithm as follows. 
Given a type environment, a pre-condition, and a post-condition, our algorithm 
computes the strongest post-condition of the code starting from the given pre- 
condition. This computation is conducted according to the syntax-directed version 
of the typing rules created essentially in the same way as a type system with 
subtyping (e.g., one described in [15]). An application of the subtyping generates 
verification conditions. The accumulated verification conditions are fed to Z3; 
the typechecking succeeds if they are successfully discharged. 


3.3 Extensions 


The implementation supports a few extensions of the formalization explained 
above, which are explained below. 

The type system implemented in HELMHOLTZ is extended with refinements for 
values thrown by raising exceptions. For example, the typing rule for instruction 
FAILWITH, which raises an exception with the value at the stack top, is given as 
follows: 


rH {a :ToY | p} FAILWITH {Y | L}&{err | 3x : T, f.p A x = err}. 


The rule expresses that, if FAILWITH is executed under a non-empty stack that 
satisfies y, then the program point just after the instruction is not reachable 
(hence, {Y | L}). The refinement 3x : T, Y.p A x = err for the exception case 
states that vy in the pre-condition with the top element x is equal to the raised 
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value err; since x is not in the scope in the exception refinement, x is bound 
by an existential quantifier. The typing rules for the other instructions can be 
extended with the “&” part easily. 

HELMHOLTZ deals with measure functions introduced by Kawaguchi et al. [9] 
and supported by Liquid Haskell [23]. If a measure function is defined by a 
Measure annotation, HELMHOLTZ “weaves” the function definition into relevant 
typing rules. For instance, given the annotation Measure len : list int -> 
int where [] = 0 | h :: t = (1 + len t), HELMHOLTZ assumes an unin- 
terpreted function symbol len and augments (RT-NiL) and (RT-Cons) as 
follows, where the last equality in each post-condition comes from the definition 
of len. 


CE {Y |p} NILT {a:Tlist>Y |p A «=[] A len[] = 0} 


Tb {a1 : Toa: TlistpY | yp} CONS {x3 :TlistpY |5a,:T,22:Tlist.p A zı: 
x2 = x3 A len (z1 :: £2) = 1 + len x2} 


4 Tool Implementation 


In this section, we discuss annotations in detail, show a case study of contract 
verification, and present verification experiments. 


4.1 Annotations 


HELMHOLTZ supports several forms of annotations (surrounded by << and >> in 
the source code), other than ContractAnnot explained in Section 2. 

Assert ® and Assume ® can appear before or after an instruction. The former 
asserts that the stack at the annotated program location satisfies the type ®; the 
assertion is verified by HELMHOLTZ. If there is an annotation Assume ®, HELM- 
HOLTZ assumes that the stack satisfies the type ® at the annotated program 
location. A user can give a hint to HELMHOLTZ by using Assume ®. The user 
has to make sure that it is correct; if an Assume annotation is incorrect, the 
verification result may be incorrect. 

LoopInv © asserts the loop invariant of a loop instruction (e.g., LOOP and 
ITER). In the current implementation, annotating a loop invariant using LoopInv 
® is mandatory. HELMHOLTZ checks that ® is indeed a loop invariant and uses it 
to verify the rest of the program. 

In the current implementation, a LAMBDA instruction, which pushes a function 
on the top of the stack, must be accompanied by the LambdaAnnot annotation, 
where Pyro > Ppost & Pabpost is a specification of the pushed function and the 
bindings (xı : Ti, ..., £n : Tn) introduce the ghost variables that can be used in 
the annotations in the body of the annotated LAMBDA instruction;’ one can omit 
the declaration of ghost variables if it is empty. The first contract in Figure 4, 
which pushes a function that takes a pair of integers and returns the sum of them, 
presents an example of LambdaAnnot. The annotated type of the function (Line 5) 


T ContractAnnot also allows declarations of ghost variable used in the code section. 
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parameter unit ; 
storage int; 
<< ContractAnnot { _ | True } -> { _ | True } & { _ | False } >> 
code { DROP; 
<< LambdaAnnot { p | p= (3, 1) } -> { x | x = 4} & { _ | False } 
Carint, bint) >> 
LAMBDA (pair int int) int 
{ << Assume { p | p= (a, b) } >> 
UNPAIR; ADD 
10 << Assert {p | p = a+ b } >> 
11 }; 
12 PUSH int 1; PUSH int 3; PAIR; EXEC; 
13 << Assert { x | x = 4 } >> 
14 NIL operation; PAIR 


OBNDUBWNR 


parameter (list int); 
storage int; 
<< Measure len : list int -> int where [] = 0 | h:: t= (1 + len t) >> 
<< ContractAnnot 
{ (p, _) | True } => { (_, ret) | len p = ret } & { | False } >> 
code { CAR; PUSH int 0; SWAP; 
<< LoopInv { 1: n | len 1 + n = len p } >> 
ITER { DROP; PUSH int 1; ADD }; 
NIL operation; 
PAIR 


FPOUOANODUBRWNHH 


Re 


Fig. 4. lambda.tz, which uses higher-order functions, and length.tz, which uses a 
measure function in the contract annotation. 


expresses that it returns 4 if it is fed with a pair (3,1). The ghost variables a and 
b are used in the annotations Assume (Line 8) and Assert (Line 10) in the body 
to denote the first and the second arguments of the pair passed to this function. 

HELMHOLTZ allows user-defined (recursive) functions to be used in annotations; 
these functions are called measure functions following the terminology of Liquid- 
Haskell [9]. The annotation Measure x : T; > To where pı = e1 |---| Pn = en 
defines a recursive function x that takes a value of type Tı, destructs it by 
the pattern matching, and returns a value of type Tə. Metavariables p and e 
represent ML-like patterns and expressions. The second contract in Figure 4, 
which computes the length of the list passed as a parameter, exemplifies the 
usage of the Measure annotation. This contract defines a measure function len 
that takes a list of integers and returns its type; it is used in ContractAnnot and 
LoopInv. 


4.2 Case Study: Contract with Signature Verification 


Figure 5 presents the code of the contract checksig.tz, which verifies that 
a sender indeed signed certain data using her private key. This contract uses 
instruction CHECK_SIGNATURE, which is supposed to be executed under a stack of 
the form key > sig > bytes > tl, where key is a public key, sig is a signature, 
and bytes is some data. CHECK_SIGNATURE pops these three values from the 
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1 parameter (pair signature string); 

2 storage (pair address key); 

3 << ContractAnnot 

4 { (param, store) | match Contract store.first with 

5 Contract<string> _ -> True | _ -> False } -> 
6 { (ops, new_store) | store = new_store && 

7 sig store.second param.first (Pack param.second) && 

8 ops = [ Transfer param.second 1 (Contract store.first) ] } 
9 & { _ | not (sig store.second param.first (Pack param.second)) } >> 
10 code { DUP; DUP; DUP; 

11 DIP { CAR; UNPAIR; DIP { PACK } }; CDDR; 

12 CHECK_SIGNATURE; ASSERT; 

13 

14 UNPAIR; CDR; SWAP; CAR; 

15 CONTRACT string; ASSERT_SOME; SWAP; 

16 PUSH mutez 1; SWAP; 

17 TRANSFER_TOKENS ; 

18 

19 NIL operation; SWAP; 

20 CONS; DIP { CDR }; 

21 PAIR 

22 } 


Fig. 5. checksig.tz, which involves signature verification. 


stack and pushes true if sig is the valid signature for bytes with the private 
key corresponding to key. 

The intended behavior of checksig.tz is as follows. It stores a pair of 
an address addr, which is the address of a contract that takes a string pa- 
rameter, and a public key key in its storage. It takes a pair (sig,s) of type 
pair signature string as a parameter where signature is the primitive 
Michelson type for signatures. This contract terminates without exception if sig 
is created from the serialized (packed) representation of s and signed by the 
private key corresponding to key. In a normal termination, this contract transfers 
1 mutez to the contract with address addr. If this signature verification fails, 
then an exception is raised. 

This behavior is expressed as a specification in the ContractAnnot annotation 
in checksig.tz as follows. 


— The refinement of its pre-condition part expresses that the address stored 
in the first element store.first of the storage store is an address of 
a contract that takes a value of type string as a parameter. This is ex- 
pressed by the pattern-matching of Contract store.first, which represents 
the contract stored at the address store.first, to the pattern expression 
Contract<string> _, which matches a contract that takes a string value. 
— The refinement of the post-condition forces the following three conditions: 
(1) the store is not updated by this contract (store = new_store); (2) 
param.first is the signature created from the packed string Pack param. 
second of the string in the second element of the parameter and signed by the 
private key corresponding to the second element store.second of the store 
(sig store.second param.first (Pack param.second)); and (3) the op- 
erations ops returned by this contract is [ Transfer param.second 1 
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(Contract store.first) ], which represents an operation of transferring 
1 mutez to the contract Contract store.first with the parameter param. 
second. The predicate sig and the constructor Pack are primitives of HELM- 
HOLTZ that can be used in an annotation. 

— The refinement in the exception part expresses that if an exception is raised, 
then the signature verification should have failed (not (sig store.second 
param.first (Pack param.second))). 


HELMHOLTZ successfully verifies checksig.tz without any additional anno- 
tation in the code section. If we change the instruction ASSERT in Line 12 to 
DROP to let the contract drop the result of the signature verification (hence, an 
exception is not raised even if the signature verification fails), the verification 
fails as intended. 


4.3 Experiments 


We applied HELMHOLTZ to various contracts; Table 1 is an excerpt of the result, 
in which we show (1) the number of the instructions in each contract (column 
#instr.) and (2) time (ms) spent to verify each contract. The experiments are 
conducted on MacOS Catalina 10.15.7 with Dual-Core Intel Core i5 (1.8 GHz), 8 
GB RAM. We used Z3 version 4.8.8. The contracts boomerang.tz, deposit.tz, 
manager .tz, vote.tz, and reservoir.tz are taken from the benchmark of Mi- 
cho-coq [3]. checksig.tz is derived from weather_insurance.tz of the official 
Tezos test suite. vote_for_delegate.tz and xcat .tz are taken from the official 
test suite; xcat . tz is simplified from the original. triangular_num. tz is a simple 
test case that we made as an example of using LOOP. The source code of these 
contracts can be found at the Web interface of HELMHOLTZ. Each contract is 
supposed to work as follows. 


— boomerang.tz: Transfers the received amount of money to the source account. 

— deposit.tz: Transfers money to the sender if the address of the sender is 
identical to that is stored in the storage. 

— manager.tz: Calls the passed function if the address of the caller matches 
the address stored in the storage. 

— vote.tz: Accepts a vote to a candidate if the voter transfers enough voting 
fee, and stores the tally. 

— checksig.tz: The one explained in Section 4.2. 

— vote_for_delegate.tz: Delegates one’s ballot in voting by stakeholders, 
which is one of the fundamental features of Tezos, to another using a primitive 
operation of Tezos. 

— xcat.tz: Transfers all stored money to one of the two accounts specified 
beforehand if called with the correct password. The account that gets money 
is decided based on whether the contract is called before or after a deadline. 


8 https: //gitlab.com/tezos/tezos/- /tree /ee2f75bb941522acbcf6d5065a9f3b2/ 
tests_ python/contracts/mini_ scenarios 
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— reservoir.tz: Sends a certain amount of money to either a contract or 
another depending on whether the contract is executed before or after the 
deadline. 

— triangular_num.tz: Calculates the sum from 1 to n, which is the passed 
parameter. 


In the experiments, we verified that each contract indeed works according to 
the intention explained above. triangular_num.tz was the only contract that 
required a manual annotation for verification in the code section; we needed to 
specify a loop invariant in this contract. 


Table 1. Benchmark result 


Filename #instr. time (ms)||Filename #instr. time (ms) 
boomerang .tz 17 35||checksig_unverified.tz 36 62 
deposit .tz 24 54||vote_for_delegate.tz 87 143 
manager .tz 29 60})xcat .tz 64 188 
vote.tz 24 62||reservoir.tz 45 87 
checksig.tz 38 65]/triangular_num.tz 16 35 


Although the numbers of instructions in these contracts are not large, they cap- 
ture essential features of smart contracts; everyone except triangular_num. tz 
executes transactions; deposit.tz and manager .tz check the identity of the 
caller; and checksig.tz conducts signature verification. The time spent on 
verification is small. 


5 Related Work 


There are several publications on the formalization of programming languages for 
writing smart contracts. Hirai [7] formalizes EVM, a low-level smart contract lan- 
guage of Ethereum and its implementation, using Lem [13], a language to specify 
semantic definitions; definitions written in Lem can be compiled into definitions 
in Coq, HOL4, and Isabelle/HOL. Based on the generated definition, he verifies 
several properties of Ethereum smart contracts using Isabelle/HOL. Bernardo et 
al. [3] implemented Mi-Cho-Cogq, a formalization of the semantics of Michelson 
using the Coq proof assistant. They also verified several Michelson contracts. 
Compared to their approach, we aim to develop an automated verification tool 
for smart contracts. Park et al. [14] developed a formal verification tool for EVM 
by using the K-framework [17], which can be used to derive a symbolic model 
checker from a formally specified language semantics (in this case, formalized 
EVM semantics [6]), and successfully applied the derived model checker to a few 
EVM contracts. It would be interesting to formalize the semantics of Michelson 
in the K-framework to compare HELMHOLTZ with the derived model checker. 
The DAO attack [18], mentioned in Section 1, is one of the notorious attacks 
on a smart contract. It exploits a vulnerability of a smart contract that is related 
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to a callback. Grossman et al. [5] proposed a type-based technique to verify 
that execution of a smart contract that may contain callbacks is equivalent to 
another execution without any callback. This property, called effectively callback 
freedom, can be seen as one of the criteria for execution of a smart contract not 
to be vulnerable to the DAO-like attack. Their type system focuses on verifying 
the ECF property of execution of a smart contract, whereas ours concerns the 
verification of generic functional properties of a smart contract. 

Benton proposes a program logic for a minimal stack-based programming 
language [2]. His program logic can give an assertion to a stack as our stack 
refinement types do. However, his language does not support first-class functions 
nor instructions for dealing with smart contracts (e.g., signature verification). 

Our type system is an extension of the Michelson type system with re- 
finement types, which have been successfully applied to various programming 
languages [16,22,9,10,20,26,23,24,25]. DTAL [25] is a notable example of an ap- 
plication of refinement types to an assembly language, a low-level language like 
Michelson. A DTAL program defines a computation using registers; we are not 
aware of refinement types for stack-based languages like Michelson. 

We notice the resemblance between our type system and a program logic for 
PCF proposed by Honda and Yoshida [8], although the targets of verification are 
different. Their logic supports a judgment of the form AF e:, B, where e is a 
PCF program, A is a pre-condition assertion, B is a post-condition assertion, and 
u represents the value that e evaluates to and can be used in B, which resembles 
our type judgment in the formalization in Section 3. Their assertion language also 
incorporates a term expression fex, which expresses the value resulting from the 
application of f to x; this expression resembles the formula call (t1, t2) = t3 used 
in a refinement predicate. We have not noticed an automated verifier implemented 
based on their logic. Further comparison is interesting future work. 


6 Conclusion 


We described our automated verification tool HELMHOLTZ for the smart contract 
language Michelson based on the refinement type system for Mini-Michelson. 
HELMHOLTZ verifies whether a Michelson program follows a specification given in 
the form of a refinement type. We also demonstrated that HELMHOLTZ successfully 
verifies various practical Michelson contracts. 

Currently, HELMHOLTZ supports approximately 80% of the whole instructions 
of the Michelson language. The definition of a measure function is limited in the 
sense that, for example, it can define only a function with one argument. We are 
currently extending HELMHOLTZ so that it can deal with more programs. 

HELMHOLTZ currently verifies the behavior of a single contract, although 
a blockchain application often consists of multiple contracts in which contract 
calls are chained. To verify such an application as a whole, we plan to extend 
HELMHOLTZ so that it can verify an inter-contract behavior compositionally by 
combining the verification results of each contract. 
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Abstract. Deep Neural Networks (DNNs) are rapidly gaining popular- 
ity in a variety of important domains. Formally, DNNs are complicated 
vector-valued functions which come in a variety of sizes and applica- 
tions. Unfortunately, modern DNNs have been shown to be vulnerable 
to a variety of attacks and buggy behavior. This has motivated recent 
work in formally analyzing the properties of such DNNs. This paper in- 
troduces SyReNN, a tool for understanding and analyzing a DNN by 
computing its symbolic representation. The key insight is to decompose 
the DNN into linear functions. Our tool is designed for analyses using 
low-dimensional subsets of the input space, a unique design point in the 
space of DNN analysis tools. We describe the tool and the underlying 
theory, then evaluate its use and performance on three case studies: com- 
puting Integrated Gradients, visualizing a DNN’s decision boundaries, 
and patching a DNN. 


Keywords: Deep Neural Networks - Symbolic representation - Inte- 
grated Gradients 


1 Introduction 


Deep Neural Networks (DNNs) [18] have become the state-of-the-art in a variety 
of applications including image recognition [53,33] and natural language process- 
ing [12]. Moreover, they are increasingly used in safety- and security-critical ap- 
plications such as autonomous vehicles [31] and medical diagnosis [10,38,28,37]. 
These advances have been accelerated by improved hardware and algorithms. 

DNNs (Section 2) are programs that compute a vector-valued function, i.e., 
from R” to R™. They are straight-line programs written as a concatenation of 
alternating linear and non-linear layers. The coefficients of the linear layers are 
learned from data via gradient descent during a training process. A number 
of different non-linear layers (called activation functions) are commonly used, 
including the rectified linear and maximum pooling functions. 

Owing to the variety of application domains and deployment constraints, 
DNNs come in many different sizes. For instance, large image-recognition and 
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natural-language processing models are trained and deployed using cloud re- 
sources [33,12], medium-size models could be trained in the cloud but deployed 
on hardware with limited resources [31], and finally small models could be trained 
and deployed directly on edge devices [47,9,22,34,35]. There has also been a re- 
cent push to compress trained models to reduce their size [24]. Such smaller 
models play an especially important role in privacy-critical applications, such 
as wake word detection for voice assistants, because they allow sensitive user 
data to stay on the user’s own device instead of needing to be sent to a remote 
computer for processing. 

Although DNNs are very popular, they are not perfect. One particularly con- 
cerning development is that modern DNNs have been shown to be extremely vul- 
nerable to adversarial examples, inputs which are intentionally manipulated to 
appear unmodified to humans but become misclassified by the DNN [54,19,40,8]. 
Similarly, fooling examples are inputs that look like random noise to humans, but 
are classified with high confidence by DNNs [41]. Mistakes made by DNNs have 
led to loss of life [36,17] and wrongful arrests [26,27]. For this reason, it is impor- 
tant to develop techniques for analyzing, understanding, and repairing DNNs. 

This paper introduces SyReNN, a tool for understanding and analyzing 
DNNs. SyReNN implements state-of-the-art algorithms for computing precise 
symbolic representations of piecewise-linear DNNs (Section 3). Given an input 
subspace of a DNN, SyReNN computes a symbolic representation that decom- 
poses the behavior of the DNN into finitely-many linear functions. SyReNN im- 
plements the one-dimensional analysis algorithm of Sotoudeh and Thakur [50] 
and extends it to the two-dimensional setting as described in Section 4. 


Key insights. There are two key insights enabling this approach, first identi- 
fied in Sotoudeh and Thakur [50]. First, most popular DNN architectures today 
are piecewise-linear, meaning they can be precisely decomposed into finitely- 
many linear functions. This allows us to reduce their analysis to equivalent 
questions in linear algebra, one of the most well-understood fields of modern 
mathematics. Second, many applications only require analyzing the behavior 
of the DNN on a low-dimensional subset of the input space. Hence, whereas 
prior work has attempted to give up precision for efficiency in analyzing high- 
dimensional input regions [48,49,16], our work has focused on algorithms that 
are both efficient and precise in analyzing lower-dimensional regions (Section 4). 


Tool design. The SyReNN tool is designed to be easy to use and extend, as 
well as efficient (Section 5). The core of SyReNN is written as a highly-optimized, 
parallel C++ server using Intel TBB for parallelization [45] and Eigen for matrix 
operations [23]. A user-friendly Python front-end interfaces with the PyTorch 
deep learning framework [44]. 


Use cases. We demonstrate the utility of SyReNN using three applications. 
The first computes Integrated Gradients (IG), a state-of-the-art measure used to 
determine which input dimensions (e.g., pixels for an image-recognition network) 
were most important in the final classification produced by the network (Sec- 
tion 6.1). The second precisely visualizes the decision boundaries of a DNN (Sec- 
tion 6.2). The last patches (repairs) a DNN to satisfy some desired specification 
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involving infinitely-many points (Section 6.3). Thus, SyReNN is an interesting 
and useful tool in the toolbox for understanding and analyzing DNNs. 


Contributions. The contributions of this paper are: 


— A definition of symbolic representation of DNNs (Section 3). 

— An efficient algorithm for computing symbolic representations for DNNs over 
low-dimensional input subspaces (Section 4). 

— A design of a usable and well-engineered tool implementing these ideas called 
SyReNN (Section 5). 

— Three applications of SyReNN (Section 6). 


Section 2 presents preliminaries about DNNs; Section 7 presents related work; 
Section 8 concludes. SyReNN is available on GitHub at https://github.com/ 
95616ARG/SyReNN. 


2 Preliminaries 


We now formally define the notion of DNN we will use in this paper. 


Definition 1. A Deep Neural Network (DNN) is a function f : R” > R™ 
which can be written f = fio fo---° fn for a sequence of layer functions fi, fo, 


e Jae 


Our work is primarily concerned with the popular class of piecewise-linear 
DNNs, defined below. In this definition and the rest of this paper, we will use the 
term “polytope” to mean a convex and bounded polytope except where specified. 


Definition 2. A function f : R” > R™ is piecewise-linear (PWL) if its input 
domain R” can be partitioned into finitely-many possibly-unbounded polytopes 
X1, X2,..., Xk such that fx, is linear for every X;. 


The most common activation function used today is the ReLU function, a 
PWL activation function which is defined below. 


Definition 3. The rectified linear function (ReLU) is a function RELU : R” > 
R” defined component-wise by 


0 fv, <0 
RELU(#); = a Oe 

v;i otherwise, 
where RELU(v); is the ith component of the vector RELU(v) and v; is the ith 
component of the vector Vv. 


In order to see that RELU is PWL, we must show that its input domain R” 
can be partitioned such that, in each partition, RELU is linear. In this case, we 
can use the orthants of R” as our partitioning: within each orthant, the signs 
of the components do not change hence RELU is the linear function that just 
zeros out the negative components. 
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Fig. 1: Example function for which f};~1,2; = {[-1, 0], [0, 1], [1, 2]}- 


Although we focus on RELU due to its popularity and expository power, 
SyReNN works with a number of other popular PWL layers include MaxPool, 
Leaky ReLU, Hard Tanh, Fully-Connected, and Convolutional layers, as defined 
in [18]. PWL layers have become exceedingly common. In fact, nearly all of the 
state-of-the-art image recognition models bundled with Pytorch [43] are PWL. 


Example 1. The DNN f : R! + R! defined by 
1 —1 
f(z) = [1-1-1] RELU | | 1 0 H 


can be broken into layers f = fı o fo o f3 where 


x > 


1 -1 
filx)=]|1 0 H , f2=RELU, and fs(v) = [1 —1 —1] v. 
-1 0 
The DNN’s input-output behavior on the domain [—1, 2] is shown in Figure 1. 


3 A Symbolic Representation of DNNs 


We formalize the symbolic representation according to the following definition: 


Definition 4. Given a PWL function f : R” — R™ and a bounded convex 
polytope X C R”, we define the symbolic representation of f on X, written frx, 


to be a finite set of polytopes fix = {Pis Pa}, such that: 


1. The set {P,, Po,..., Pn} partitions X, except possibly for overlapping bound- 
aries. 

2. Each P; is a bounded convex polytope. 

3. Within each P;, the function fp, is linear. 


Notably, if f is a DNN using only PWL layers, then f is PWL and so we can 
define ftx. This symbolic representation allows one to reduce questions about 
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the DNN f to questions about finitely-many linear functions F;. For example, 
because linear functions are convex, to verify that Va € X. f(a) € Y for some 
polytope Y, it suffices to verify VP; € fix V? € Vert(P;). f(v) € Y, where 
Vert(P;) is the (finite) set of vertices for the bounded convex polytope P;; thus, 
here both of the quantifiers are over finite sets. The symbolic representation 
described above can be seen as a generalization of the EXACTLINE representa- 
tion [50], which considered only one-dimensional restriction domains of interest. 


Example 2. Consider again the DNN f : R! > R! given by 


1 —1 
f(z) := [1 —1 —1] ReLU 1 0 H 
1 0 
and the region of interest X = [—1,2]. The input-output behavior of f on X is 


shown in Figure 1. From this, we can see that 


fix = {[-1,0], [0, 1], [1, 2]}- 


Within each of these partitions, the input-output behavior is linear, which for 
Rt > R! we can see visually as just a line segment. As this set fully partitions 
X, then, this is a valid frx. 


4 Computing the Symbolic Representation 


This section presents an efficient algorithm for computing Tx for a DNN f com- 
posed of PWL layers. To retain both scalability and precision, we will require 
the input region X be two-dimensional. This design choice is relatively unex- 
plored in the neural-network analysis literature (most analyses strike a balance 
between precision and scalability, ignoring dimensionality). We show that, for 
two-dimensional X, we can use an efficient polytope representation to produce 
an algorithm that demonstrates good best-case and in-practice efficiency while 
retaining full precision. This algorithm represents a direct generalization of the 
approach of [50]. 

The difficulties our algorithm addresses arise from three areas. First, when 
computing fix there may be exponentially many such partitions on all of R” 
but only a small number of them may intersect with X. Consequently, the algo- 
rithm needs to be able to find those partitions that intersect with X efficiently 
without explicitly listing all of the partitions on R”. Second, it is often more 
convenient to specify the partitioning via hyperplanes separating the partitions 
than explicit polytopes. For example, for the one-dimensional RELU function 
we may simply state that the line x = 0 separates the two partitions, because 
RELU is linear both in the region x < 0 and x > 0. Finally, neural networks 
are typically composed of sequences of linear and piecewise-linear layers, where 
the partitioning imposed by each layer individually may be well-understood but 
their composition is more complex. For example, identifying the linear partitions 
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of y = RELU(4- RELU(—32 — 1) + 2) is non-trivial, even though we know the 
linear partitions of each composed function individually. 

Our algorithm only requires the user to specify the hyperplanes defining the 
partitioning for the activation function used in each layer; our current implemen- 
tation comes with support for common PWL activation functions. For example, 
if a RELU layer is used for an n-dimensional input vector, then the hyperplanes 
would be defined by the equations xı = 0,22 = 0,...,2%, = 0. It then com- 
putes the symbolic representation for a single layer at a time, composing them 
sequentially to compute the symbolic representation across the entire network. 

To allow such compositions of layers, instead of directly computing f;x, we 
will define another primitive, denoted by the operator ® and sometimes referred 
to as EXTEND, such that 


EXTEND(h,g) =h@G=hog. (1) 


Consider f = fn © fn-1 9+: ° fi, and let I : x œ x be the identity map. I is 
linear across its entire input space, and, thus, I;x = {X}. By the definition of 


EXTEND(f1,-), we have fı & Tix =(fioI)x = fixi where the final equality 
holds by the definition of the identity map J. We can then iteratively apply this 
procedure to inductively compute (fi 0---0 fix from (fi-1 0> fi)ix like so: 


fi 8 (fi-10-++0 fi)ry = (fio fi-1 0++-0 fi)ry 
until we have computed (fno fn—10-+-- 0 fihx = Fix which is the required 
symbolic representation. 


4.1 Algorithm for Extend 


Algorithm 1 present an algorithm for computing EXTEND for arbitrary PWL 
functions, where EXTEND(h,g) = h 8 9 = hog. 

Geometric intuition for the algorithm. Consider the RELU function (Def- 
inition 3). It can be shown that, within any orthant (i.e., when the signs of all 
coefficients are held constant), RELU (7) is equivalent to some linear function, 
in particular the element-wise product of ¥ with a vector that zeroes out the 
negative-signed components. However, for our algorithm, all we need to know is 
that the linear partitions of RELU (in this case the orthants) are separated by 
hyperplanes zı = 0, £2 = 0,...,%, = 0. 

Given a two-dimensional convex bounded polytope X, the execution of the 
algorithm for f = RELU can be visualized as follows. We pick some vertex v 
of X, and begin traversing the boundary of the polytope in counter-clockwise 
order. If we hit an orthant boundary (corresponding to some hyperplane x; = 0), 
it implies that the behavior of the function behaves differently at the points 
of the polytope to one side of the boundary from those at the other side of 
the boundary. Thus, we partition X into Xı and X2, where Xj, lies to one 
side of the hyperplane and Xə lies to the other side. We recursively apply this 
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procedure to X; and Xə until the resulting polytopes all lie on exactly one side 
of every hyperplane (orthant boundary). But lying on exactly one side of every 
hyperplane (orthant boundary) implies each polytope lies entirely within a linear 
partition of the function (a single orthant), hence the application of the function 
on that polytope is linear, and hence we have our partitioning. 


Functions used in algorithm. Given a two-dimensional bounded convex 
polytope X, Vert(X) returns a list of its vertices in counter-clockwise order, 
repeating the initial vertex at the end. Given a set of points X, ConvexHul1 CX ) 
represents their convex hull (the smallest bounded polytope containing every 
point in X). Given a scalar value x, Sign(a) computes the sign of that value 
(i.e., —1 if x <0, +1 if x > 0, and 0 if z = 0). 

Algorithm description. The key insight of the algorithm is to recursively 
partition the polytopes until such a partition lies entirely within a linear region 
of the function f. Algorithm 1 begins by constructing a queue containing the 
polytopes of gpx. Each iteration either removes a polytope from the queue that 
lies entirely in one linear region (placing it in Y), or splits (partitions) some 
polytope into two smaller polytopes that get put back into the queue. When we 
pop a polytope P from the queue, Line 6 iterates over all hyperplanes N;,.-x7 = bk 
defining the piecewise-linear partitioning of f, looking for any for which some 
vertex V; lies on the positive side of the hyperplane and another vertex V; lies 
on the negative side of the hyperplane. If none exist (Line 7), by convexity we 
are guaranteed that the entire polytope lies entirely on one side with respect to 
every hyperplane, meaning it lies entirely within a linear partition of f. Thus, we 
can add it to Y and continue. If two such vertices are found (starting Line 10), 
then we can find “extreme” i and j indices such that V; is the last vertex in 
a counter-clockwise traversal to lie on the same side of the hyperplane as Vj 
and V; is the last vertex lying on the opposite side of the hyperplane. We then 
call SplitPlane() (Algorithm 2) to actually partition the polytope on opposite 
sides of the hyperplane, adding both to our worklist. 

In the best case, each partition is in a single orthant: the algorithm never calls 
SplitPlane() at all — it merely iterates over all of the n input partitions, checks 
their v vertices, and appends to the resulting set (for a best-case complexity of 
O(nv)). In the worst case, it splits each polytope in the queue on each face, 
resulting in exponential time complexity. As we will show in Section 6, this 
exponential worst-case behavior is not encountered in practice, thus making 
SyReNN a practical tool for DNN analysis. 

Please see the extended version of this paper for a worked example of the 
algorithm’s execution. 


4.2 Representing Polytopes 


We close this section with a discussion of implementation concerns when repre- 
senting the convex polytopes that make up the partitioning of frz. In standard 
computational geometry, bounded polytopes can be represented in two equiva- 
lent forms: 
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Algorithm 1: f ® gx for two-dimensional X. f is defined by hyper- 
planes Ny - x = bı through N,,- x = bm such that, within any partition 
imposed by the hyperplanes f is equivalent to some affine function. 
Input: gix = {Pi,..., Pr}. 
Output: fogy 


1 W + ConstructQueue (gix) 
2Y- 
3 while W not empty do 
4 P + Pop(W) 
5 V + Vert (P) 
6 K & {Nk | 3i, j : Siga(Np - g(Vi) — bk) > OA Sign(Np - g(Vi) — be) < 0} 
7 if K = Ø then 
8 Y<YU{P} 
9 | continue 
10 N,b 4+ any element from K 
11 i + argmax,{Sign(N - g(V;) — b) = Sign (N - g(Vi) — b)} 
12 | je argmax,{Sign(N - g(V;) — b) # Sign(N - g(Vi) — b)} 
13 for V’ € SplitPlane(V, g, i, j, N, b) do 
14 L W + Push (W, ConvexHull(V’)) 
15 return Y 


1. The half-space or H-representation, which encodes the polytope as an in- 
tersection of finitely-many half-spaces. (Each half-space being defined as a 
halfspace defined by an affine inequality Ax < b.) 

2. The vertex or V-representation, which encodes the polytope as a set of 
finitely many points; the polytope is then taken to be the convex hull of 
the points (i.e., smallest convex shape containing all of the points). 


Certain operations are more efficient when using one representation compared 
to the other. For example, finding the intersection of two polytopes in an H- 
representation can be done in linear time by concatenating their representative 
half-spaces, but the same is not possible in V-representation. 

There are two main operations on polytopes we need perform in our algo- 
rithms: (i) splitting a polytope with a hyperplane, and (ii) applying an affine 
map to all points in the polytope. In general, the first is more efficient in an 
H-representation, while the latter is more efficient in a V-representation. How- 
ever, when restricted to two-dimensional polygons, the former is also efficient in 
a V-representation, as demonstrated by Algorithm 2, helping to motivate our 
use of the V-representation in our algorithm. 

Furthermore, the two polytope representations have different resiliency to 
floating-point operations. In particular, H-representations for polytopes in R” 
are notoriously difficult to achieve high-precision with, because the error in- 
troduced from using floating point numbers gets arbitrarily large as one goes 
in a particular direction along any hyperplane face. Ideally, we would like the 
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Algorithm 2: SplitPlane(V, g, i, j, N, b) 

Input: V, the vertices of the polytope in the input space of g. A function g. i 
is the index of the last vertex lying on the same side of the orthant face 
as Vı. j is the index of the last vertex lying on the opposite side of the 
orthant face as V;. N and b define the hyperplane N - x = b to split on. 

Output: {P,, P2}, two sets of vertices whose convex hulls form a partitioning 

of V such that each lies on only one side of the N - a = b hyperplane. 
b-N-9(Vi) 

pi CV + TaVi) Vint — Vi) 

i b—N-g(V;) 

Pi = Vi + wawo V — Vs) 

A & {pi,pj}U {v E V | Sign(N -v — b) = Sign(N - V; — b)} 

B + {pi,p;} U {v € V | Sign(N - v — b) = Sign(N - V; — b)} 

return {A, B} 


ji 


ar U N 


hyperplane to be most accurate in the region of the polytope itself, which corre- 
sponds to choosing the magnitude of the norm vector correctly. Unfortunately, 
to our knowledge, there is no efficient algorithm for computing the ideal floating 
point H-representation of a polytope, although libraries such as APRON [30] 
are able to provide reasonable results for low-dimensional spaces. However, be- 
cause neural networks utilize extremely high-dimensional spaces (often hundreds 
or thousands of dimensions) and we wish to iteratively apply our analysis, we 
find that errors from using floating-point H-representations can quickly multiply 
and compound to become infeasible. By contrast, floating-point inaccuracies in 
a V-representation are directly interpretable as slightly misplacing the vertices 
of the polytope; no “localization” process is necessary to penalize inaccuracies 
close to the polytope more than those far away from it. 

Another difference is in the space complexity of the representation. In gen- 
eral, H-representations can be more space-efficient for common shapes than V- 
representations. However, when the polytope lies in a low-dimensional subspace 
of a larger space, the V-representation is usually significantly more efficient. 

Thus, V-representations are a good choice for low-dimensionality polytopes 
embedded in high-dimensional space, which is exactly what we need for analyzing 
neural networks with two-dimensional restriction domains of interest. This is why 
we designed our algorithms to rely on Vert(X), so that they could be directly 
computed on a V-representation. 

The 2D algorithm described above can be seen as implementing the recursive 
case of a more general, n-dimensional version of the algorithm that recurses on 
each of the (n — 1)-dimensional facets. Please see the extended version of this 
paper for more details. 


5 SyReNN tool 


This section provides more details about the design and implementation of our 
tool, SyReNN (Symbolic Representations of Neural Networks), which computes 
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fix: where f is a DNN using only piecewise-linear layers and X is a union of 


one- or two-dimensional polytopes. The tool is available under the MIT license 
at https://github.com/95616ARG/SyReNN and in the PyPI package pysyrenn. 


Input and output format. SyReNN supports reading DNNs from two stan- 
dard formats: ERAN (a textual format used by the ERAN project [1]) as well as 
ONNX (an industry-standard format supporting a wide variety of different mod- 
els) [42]. Internally, the input DNN is described as an instance of the Network 
class, which is itself a list of sequential Layers. A number of layer types are 
provided by SyReNN, including FullyConnectedLayer, ConvolutionalLayer, 
and ReLULayer. To support more complicated DNN architectures, we have im- 
plemented a ConcatLayer, which represents a concatenation of the output of 
two different layers. The input region of interest, X, is defined as a polytope 
described by a list of its vertices in counter-clockwise order. The output of the 
tool is the symbolic representation fix. 


Overall Architecture. We designed SyReNN in a client-server architecture 
using gRPC [20] and protocol buffers [21] as a standard method of communica- 
tion between the two. This architecture allows the bulk of the heavy computation 
to be done in efficient C++ code, while allowing user-friendly interfaces in a va- 
riety of languages. It also allows practitioners to run the server remotely on a 
more powerful machine if necessary. The C++ server implementation uses the 
Intel TBB library for parallelization. Our official front-end library is written 
in Python, and available as a package on PyPI so installation is as simple as 
pip install pysyrenn. The entire project can be built using the Bazel build 
system, which manages dependencies using checksums. 


Server Architecture. The major algorithms are implemented as a gRPC 
server written in C++. When a connection is first made, the server initializes 
the state with an empty DNN f(x) = x. During the session, three operations 
are permitted: (i) append a layer g so that the current session’s DNN is updated 
from fo to f(x) := g(fo(x)), (ii) compute fix for a one-dimensional X, or (iii) 
compute fix for a two-dimensional X. We have separate methods for one- and 
two-dimensional X, because the one-dimensional case has specific optimizations 
for controlling memory usage. The SegmentedLine and UPolytope types are 
used to represent one- and two-dimensional partitions of X, respectively. When 
operation (1) is performed, a new instance of the LayerTransformer class is ini- 
tialized with the relevant parameters and added to a running vector of the cur- 
rent layers. When operation (2) is performed, a new queue of SegmentedLines is 
constructed, corresponding to X, and the before-allocated LayerTransformers 
are applied sequentially to compute fx. In this case, extra control is provided 
to automatically gauge memory usage and pause computation for portions of 
X until more memory is made available. Finally, when operation (3) is a per- 
formed, a new instance of UPolytope is initialized with the vertices of X and 
the LayerTransformers are again applied sequentially to compute fix 


Client Architecture. Our Python client exposes an interface for defining 
DNNs similar to the popular Sequential-Network Keras API [11]. Objects repre- 
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sent individual layers in the network, and they can be combined sequentially into 
a Network instance. The key addition of our library is that this Network exposes 
methods for computing fix given a V-representation description of X. To do 
this, it invokes the server and passes a layer-by-layer description of f followed 
by the polytope X, then parses the response fa: 


Extending to support different layer types. Different layer types are sup- 
ported by sub-classing the LayerTransformer class. Instances of this class ex- 
pose a method for computing EXTEND(h,-) for the corresponding layer h. To 
simplify implementation, two sub-classes of LayerTransformer are provided: 
one for entirely-linear layers (such as fully-connected and convolutional layers), 
and one for piecewise-linear layers. For fully-linear layers, all that needs to be 
provided is a method computing the layer function itself. For piecewise-linear 
layers, two methods need to be provided: (1) computing the layer function itself, 
and (2) one describing the hyperplanes which separate the linear regions. The 
base class then directly implements Algorithm 1 for that layer. This architecture 
makes supporting new layers a straight-forward process. 

Float Safety. Like Reluplex [32], SyReNN uses floating-point arithmetic to 
compute fix efficiently. Unfortunately, this means that in some cases its results 
will not be entirely precise when compared to a real-valued or multiple-precision 
version of the algorithm. Approaches for addressing this are discussed in the 
extended version of this paper. 


6 Applications of SyReNN 


This section presents the use of SyReNN in three example case studies. 


6.1 Integrated Gradients 


A common problem in the field of explainable machine learning is understanding 
why a DNN made the prediction it did. For example, given an image classified 
by a DNN as a ‘cat,’ why did the DNN decide it was a cat instead of, say, a dog? 
Were there particular pixels which were particularly important in deciding this? 
Integrated Gradients (IG) [52] is the state-of-the-art method for computing such 
model attributions. 


Definition 5. Given a DNN f, the integrated gradients along dimension 7 for 
input x and baseline z’ is defined to be: 


Tat = Ge = alin I. oeat — CEED (2) 


The computed value [G;(x) determines relatively how important the ith input 
(e.g., pixel) was to the classification. 

However, exactly computing this integral requires a symbolic, closed form 
for the gradient of the network. Until [50], it was not known how to compute 
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such a closed-form and so IGs were always only approximated using a sampling- 
based approach. Unfortunately, because it was unknown how to compute the true 
value, there was no way for practitioners to determine how accurate their ap- 
proximations were. This is particularly concerning in fairness applications where 
an accurate attribution is exceedingly important. 

In [50], it was recognized that, when X = ConvexHull({z,x’}), fix can be 
used to exactly compute IG;(x). This is because within each partition of fix 
the gradient of the network is constant because it behaves as a linear function, 
and hence the integral can be written as the weighted sum of such finitely- 
many gradients.! Using our symbolic representation, the exact IG can thus be 
computed as follows: 


X a-u EED dutu ia 


ConvexHull({y:,y;})E Ficonvexm ({x,a! }) 


Where here yi, y; are the endpoints of the segment with y; closer to x and y; 
closest to x’. 


Implementation. The helper class IntegratedGradientsHelper is provided 
by our Python client library. It takes as input a DNN f and a set of (x, x’) 
input-baseline pairs and then computes IG for each pair. 


Empirical Results. In [50] SyReNN was used to show conclusively that ex- 
isting sampling-based methods were insufficient to adequately approximate the 
true IG. This realization led to changes in the official IG implementation to use 
the more-precise trapezoidal sampling method we argued for. 


Timing Numbers. In those experiments, we used SyReNN to compute fix 
for three different DNNs f, namely the small, medium, and large convolutional 
models from [1]. For each DNN, we ran SyReNN on 100 one-dimensional lines. 
The 100 calls to SyReNN completed in 20.8 seconds for the small model, 183.3 
for the medium model, and 615.5 for the big model. Tests were performed on an 
Intel Core i7-7820X CPU at 3.60GHz with 32GB of memory. 


6.2 Visualization of DNN Decision Boundaries 


Whereas IG helps understand why a DNN made a particular prediction about 
a single input point, another major task is visualizing the decision boundaries 
of a DNN on infinitely-many input points. Figure 2 shows a visualization of an 
ACAS Xu DNN [31] which takes as input the position of an airplane and an 
approaching attacker, then produces as output one of five advisories instructing 
the plane, such as “clear of conflict” or to move “weak left.” Every point in 
the diagram represents the relative position of the approaching plane, while the 
color indicates the advisory. 


' As noted in [50], this technically requires a slight strengthening of the definition of 
fix which is satisfied by our algorithms as defined above. 
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(a) Decision boundaries (b) Decision bound- (c) Decision bound- 
computed using fix aries computed using aries computed using 
DeepPoly[k = 257] DeepPoly[k = 1007] 


Legend: = Clear-of-Conflict, == Weak Right, = Strong Right, = Strong Left, == Weak Left. 


Fig. 2: Visualization of decision boundaries for the ACAS Xu network. Using 
SyReNN (left) quickly produces the exact decision boundaries. Using abstract 
interpretation-based tools like DeepPoly (middle and right) are slower and pro- 
duce only imprecise approximations of the decision boundaries. 


One approach to such visualizations is to simply sample finitely-many points 
and extrapolate the behavior on the entire domain from those finitely-many 
points. However, this approach is imprecise and risks missing vital information 
because there is no way to know the correct sampling density to use to identify 
all important features. 


Another approach is to use a tool such as DeepPoly [49] to over-approximate 
the output range of the DNN. However, because DeepPoly is a relatively coarse 
over-approximation, there may be regions of the input space for which it cannot 
state with confidence the decision made by the network. In fact, the approxima- 
tions used by DeepPoly are extremely coarse. A naive application of DeepPoly 
to this problem results in it being unable to make claims about any of the in- 
put space of interest. In order to utilize it, we must partition the space and 
run DeepPoly within each partition, which significantly slows down the analysis. 
Even when using 25? partitions, Figure 2b shows that most of the interesting 
region is still unclassifiable with DeepPoly (shown in white). Only with 100? par- 
titions can DeepPoly effectively approximate the decision boundaries, although 
it is still quite imprecise. 


By contrast, fix can be used to exactly determine the decision boundaries 
on any 2D polytope subset of the input space, which can then be plotted. This is 
shown in Figure 2a. Furthermore, as shown in Table 1, the approach using fix 
is significantly faster than that using ERAN, even as we get the precise answer 
instead of an approximation. Such visualizations can be particularly helpful in 
identifying issues to be fixed using techniques such as those in Section 6.3. 
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Table 1: Comparing the performance of DNN visualization using SyReNN versus 
DeepPoly for the ACAS Xu network [31]. qe size is the number of partitions 
in the symbolic representation. SyReNN time is the time taken to compute fix 
using SyReNN. DeepPoly|k] time is the time taken to compute DeepPoly for 
approximating decision boundaries with k partitions. Each scenario represents a 
different two-dimensional slice of the input space; within each slice, the heading 
of the intruder relative to the ownship along with the speed of each involved 
plane is fixed. 


DeepPoly time (secs) 


Scenario fix size SyReNN time (secs) k = 25? k = 55? k = 100? 
Head-On, Slow 33200 10.9 9.1 43.2 141.3 
Head-On, Fast 30769 10.2 8.2 39.0 128.0 
Perpendicular, Slow 37251 12.5 9.2 42.9 141.7 
Perpendicular, Fast 33931 11.4 8.2 39.2 127.5 
Opposite, Slow 36743 12.1 9.8 46.7 152.5 
Opposite, Fast 38965 13.0 9.5 45.2 147.3 
-Perpendicular, Slow 36037 11.9 9.5 45.0 146.4 
-Perpendicular, Fast 33208 10.9 8.3 39.5 130.2 


Implementation. The helper class PlanesClassifier is provided by our 
Python client library. It takes as input a DNN f and an input region X, then 
computes the decision boundaries of f on X. 


Timing Numbers. Timing comparisons are given in Table 1. We see that 
SyReNN is quite performant, and the exact SyReNN can be computed more 
quickly than even a mediocre approximation from DeepPoly using 55? parti- 
tions. Tests were performed on a dedicated Amazon EC2 c5.metal instance, 
using BenchExec [5] to limit the number of CPU cores to 16 and RAM to 16GB. 


6.3 Patching of DNNs 


We have now seen how SyReNN can be used to visualize the behavior of a DNN. 
This can be particularly useful for identifying buggy behavior. For example, 
in Figure 2a we can see that the decision boundary between “strong right” and 
“strong left” is not symmetrical. 

The final application we consider for SyReNN is patching DNNs to correct 
undesired behavior. Patching is described formally in [51]. Given an initial net- 
work N and a specification ¢ describing desired constraints on the input/output, 
the goal of patching is to find a small modification to the parameters of N pro- 
ducing a new DNN NV’ that satisfies the constraints in ¢. 

The key theory behind DNN patching we will use was developed in [51]. The 
key realization of that work is that, for a certain DNN architecture, correcting the 
network behavior on an infinite, 2D region X is exactly equivalent to correcting 
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) Before patching. (b) Patched pockets. ) Patched bands. ) Patched symme- 
Ae 
Legend: = Clear-of-Conflict, Weak Right, = Strong Right, == Strong Left, == Weak Left. 


Fig. 3: Network patching. 


its behavior on the finitely-many vertices Vert(P;) for each of the finitely-many 
P; € fix. Hence, SyReNN plays a key role in enabling efficient DNN patching. 

For this case study, we patched the same aircraft collision-avoidance DNN 
visualized in Section 6.2. We patched the DNN three times to correct three dif- 
ferent buggy behaviors of the network: (i) remove “Pockets” of strong left /strong 
right in regions that are otherwise weak left /weak right; (ii) remove the “Bands” 
of weak-left advisory behind and to the left of the plane; and (iii) enforce “Sym- 
metry” across the horizontal. The DNNs before and after patching with different 
specifications are shown in Figure 3. 


Implementation The helper class NetPatcher is provided by our Python 
client library. It takes as input a DNN f and pairs of input region, output label 
Xi, Yı, then computes a new DNN f’ which maps all points in each X; into Y;. 


Timing Numbers. As in Section 6.2, computing fix for use in patching took 
approximately 10 seconds. 


7 Related Work 


The related problem of exact reach set analysis for DNNs was investigated in 
[58]. However, the authors use an algorithm that relies on explicitly enumerating 
all exponentially-many (2”) possible signs at each RELU layer. By contrast, 
our algorithm adapts to the actual input polytopes, efficiently restricting its 
consideration to activations that are actually possible. 

Hanin and Rolnick [25] prove theoretical properties about the cardinality of 
fix x for RELU networks, showing that |/f; fixl is expected to grow polynomially 
with the number of nodes in the network for randomly-initialized networks. 

Thrun [55] and Bastani et al.[4] extract symbolic rules meant to approximate 
DNNs, which can approximate the symbolic representation fix 

In particular, the ERAN [1] tool and underlying DeepPoly [49] domain were 
designed to verify the non-existence of adversarial examples. Breutel et al. [6] 
give an iterative refinement algorithm for an overapproximation of the weakest 
precondition as a polytope where the required output is also a polytope. 
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Scheibler et al. [46] verify the safety of a machine-learning controller using 
the SMT-solver iSAT3, but support small unrolling depths and basic safety prop- 
erties. Zhu et al. [60] use a synthesis procedure to generate a safe deterministic 
program that can enforce safety conditions by monitoring the deployed DNN 
and preventing potentially unsafe actions. The presence of adversarial and fool- 
ing inputs for DNNs as well as applications of DNNs in safety-critical systems 
has led to efforts to verify and certify DNNs [3,32,14,29,16,7,57,49,2]. Approai- 
mate reachability analysis for neural networks safely overapproximates the set 
of possible outputs [16,58,59,57,13,56]. 

Prior work in the area of network patching focuses on enforcing constraints 
on the network during training. DiffAI [39] is an approach to train neural net- 
works that are certifiably robust to adversarial perturbations. DL2 [15] allows 
for training and querying neural networks with logical constraints. 


8 Conclusion and Future Work 


We presented SyReNN, a tool for understanding and analyzing DNNs. Given 
a piecewise-linear network and a low-dimensional polytope subspace of the in- 
put subspace, SyReNN computes a symbolic representation that decomposes the 
behavior of the DNN into finitely-many linear functions. We showed how to effi- 
ciently compute this representation, and presented the design of the correspond- 
ing tool. We illustrated the utility of SyReNN on three applications: computing 
exact IG, visualizing the behavior of DNNs, and patching (repairing) DNNs. 

In contrast to prior work, SyReNN explores a unique point in the design 
space of DNN analysis tools. Instead of trading off precision of the analysis 
for efficiency, SyReNN focuses on analyzing DNN behavior on low-dimensional 
subspaces of the domain, for which we can provide both efficiency and precision. 

We plan on extending SyReNN to make use of GPUs and other massively- 
parallel hardware to more quickly compute Fix for large f or X. Techniques 
to support input polytopes that are greater than two dimensional is also a ripe 
area of future work. We may also be able to take advantage of the fact that non- 
convex polytopes can be represented efficiently in 2D. Extending algorithms for 
fix to handle architectures such as Recurrent Neural Networks (RNNs) will 
open up new application areas for SyReNN. 
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Abstract. In this paper, we present MachSMT, an algorithm selection 
tool for Satisfiability Modulo Theories (SMT) solvers. MachSMT sup- 
ports the entirety of the SMT-LIB language. It employs machine learn- 
ing (ML) methods to construct both empirical hardness models (EHMs) 
and pairwise ranking comparators (PWCs) over state-of-the-art SMT 
solvers. Given an SMT formula Z as input, MachSMT leverages these 
learnt models to output a ranking of solvers based on predicted run 
time on the formula Z. We evaluate MachSMT on the solvers, bench- 
marks, and data obtained from SMT-COMP 2019 and 2020. We observe 
MachSMT frequently improves on competition winners, winning 54 divi- 
sions outright and up to a 198.4% improvement in PAR-2 score, notably 
in logics that have broad applications (e.g., BV, LIA, NRA, etc.) in veri- 
fication, program analysis, and software engineering. The MachSMT tool 
is designed to be easily tuned and extended to any suitable solver appli- 
cation by users. MachSMT is not a replacement for SMT solvers by any 
means. Instead, it is a tool that enables users to leverage the collective 
strength of the diverse set of algorithms implemented as part of these 
sophisticated solvers. 


Keywords: SMT Solvers - Machine Learning - Algorithm Selection 


1 Introduction 


Satisfiability Modulo Theories (SMT) solvers are tools to decide the satisfiability 
of formulas over first-order theories such as bit-vectors, floating-point arithmetic, 
integers, reals, strings, arrays, and their combinations [44,9,24,18,47,20,46]. In 
recent years, SMT solvers have had a revolutionary impact on applications in 
software engineering (broadly construed), such as software testing [17,48] and 
verification [23,15,27,39], as well as in sub-fields of AI [53,35,30]. This impact is a 
driver for an insatiable demand for evermore efficient solvers, not only to scale to 
larger instances obtained from existing applications (e.g., automatic bug-finding 
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in commercial software [26,4]), but also to solve problems from new application 
domains (e.g., verification and synthesis of cryptographic primitives [13]). 


Motivation for Algorithm Selection for SMT Solvers. In response 
to this high demand, the SMT community has developed a plethora of solver 
heuristics and configurations. For example, in the 2019 edition of the annual 
SMT-COMP competition [10,31], more than 50 solvers and their configurations 
were submitted. Many of these solvers implement very different algorithms to 
tackle the satisfiability problem for (a combination of) first-order theories, with 
significantly varying performance profiles. For example, in the quantifier-free 
theory of floating-point arithmetic (QF_FP), there exist several substantially 
different decision procedures, e.g., bit-blasting [16], abstract CDCL [14], inter- 
reduction methods [55], and reduction to global optimization [22,11]. In this 
specific setting of floating-point solvers, input instances may be derived from 
a variety of applications, such as software verification or analysis of machine 
learning (ML) models [56]. In such a scenario, a very natural question arises: 
which solver or configuration is best for a given input instance? 

Another well-known issue with many SMT solvers (even state-of-the-art ones) 
is that users may not know a priori which formula features or encoding would 
make an instance easy to solve. This can be very frustrating for users as they 
have to try a large number of different encodings with different solver configura- 
tions before they can figure out which combination works best for their specific 
scenario, which may result in a combinatorial explosion. Users have also noted 
that as their applications change, what was once a great solver configuration 
in an earlier setting is suddenly not very good in the newer one. One possible 
approach to address this problem is to use a portfolio of solvers, just as has 
been successfully done in the context of SAT solvers. Unfortunately, given the 
plethora of solvers (more than 50 in SMT-COMP 2019 and 2020) and configura- 
tions (CVC4 [9] alone utilizes 23 different configurations in a sequential portfolio 
setting for quantified logics) such an approach becomes quickly infeasible in the 
SMT solver setting. 


Brief Overview of MachSMT. One way to address the above-mentioned 
problems is to use an automated algorithm-selection tool that can automati- 
cally and with high accuracy predict the best algorithm from a given set of 
algorithms for a specific input. Such a tool selects the best SMT solver from a 
set of solvers for a given SMT formula. To this end, we introduce MachSMT, 
a machine learning-based algorithm-selection tool. MachSMT supports the en- 
tirety of the SMT-LIB language [8]. It takes as input an instance for a specified 
theory of interest, and outputs a ranking of solvers predicted to have the lowest 
runtime. Internally, MachSMT is a set of machine learnt models constructed by 
analyzing the runtimes of solver configurations on benchmarks with respect to 
the frequencies of grammatical constructs (e.g., predicates, functions, rounding 
modes, etc.). Additionally, it defines other syntactical properties that can have 
influence in performance (e.g., quantifier nesting levels). 

At a high-level, MachSMT works as follows. At its core, MachSMT uses two 
techniques to perform algorithm selection: empirical hardness models (EHMs) 
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and pairwise ranking comparators (PWCs). MachSMT uses frequencies of gram- 
matical constructs from the SMT-LIB language [8], in addition to several other 
syntactical metrics for features pipelined with Principal Component Analysis 
(PCA) and AdaBoosting to construct its empirical hardness models and com- 
parators. 

An EHM for a given solver S is a mapping from an input instance T to a 
predicted runtime of S on T. At runtime, given Z, MachSMT queries all EHMs for 
all solvers (that were considered during training) over Z, and outputs a ranking of 
solvers based on their predicted runtimes (top-ranked solver is predicted to solve 
the input problem the fastest). By contrast, a learnt pairwise ranking comparator 
(PWC) is a mapping that takes as input pair (S1, S2) of solvers and an input 
instance Z, and outputs a ranking over the input solvers based on which one of 
them is predicted to have a lower runtime on Z (denoted as Sı < S2 or Sı > S2). 
During evaluation, given an input instance Z, MachSMT uses the learnt PWC 
as a comparator to rank the set of solvers. 

While algorithm selection has been considered in the broad setting of solvers 
(e.g., QBF solvers [50] and SAT solvers [67]) as well as certain specific SMT 
theories [57,5,64], we are not aware of previous work on algorithm selection aimed 
at the entirety of SMT-LIB [7]. Our results demonstrate that the MachSMT 
algorithm selector is highly effective, in that it outperforms the competition 
winners on the majority of tracks from the SMT-COMP in 2019 and 2020. 

Perhaps the first algorithm selection tool in the context of logic solvers was 
SATZilla [67]. Since its introduction, SATZilla has had a tremendous impact 
on SAT solver research, winning multiple gold medals in the SAT competitions. 
Having said that, there are several significant differences between MachSMT and 
SATZilla. Briefly, SAT Zilla deploys a feature selection scheme to avoid the curse 
of dimensionality, while MachSMT leverages a learnt dimensionality reduction 
scheme, namely, Principal Component Analysis (PCA). In fact, a feature selec- 
tion scheme would simply not scale in the context of SMT solvers given the very 
large number of learnt models that are incorporated into MachSMT. We discuss 
additional differences between SAT Zilla and MachSMT at length in Section 6. 

It goes without saying that MachSMT is only as powerful as the underlying 
solvers that it has access to. MachSMT is clearly not a replacement for any par- 
ticular SMT solver, but rather a tool that enables users to leverage the collective 
strength of the diverse set of algorithms and configurations implemented as part 
of these sophisticated solvers. 


Contributions. 
We make the following contributions in this paper. 


1. The MachSMT Algorithm Selection Tool. We present the MachSMT 
tool, an algorithm selection tool for the entirety of SMT-LIB. MachSMT 
uses machine learning (ML) to construct EHMs and PWCs of solvers for 
algorithm selection. A key feature of MachSMT tool is that it is designed to 
be easily tuned and extended by SMT solver users (Section 3). 
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2. Analysis of MachSMT over SMT-COMP 2019 and 2020 Bench- 
marks and Solvers. We perform an extensive experimental analysis of 
MachSMT across all divisions from SMT-COMP 2019 and 2020. We observe 
that MachSMT improves on competition winners in 54 divisions, with up to 
198.4% improvement in performance for the QF_BVFPLRA SQ ’20 and up 
to 191.1% for the QF_BVFP SQ ’20 division. We provide our learnt mod- 
els, used in our experimentation, for ease of use and transparency. While 
building learnt models for MachSMT can be computationally expensive 
(a one time cost), installing, downloading, and using our models is easy 
(Section 4). All source code and learnt models from our experience can be 
found at: https://github.com/j29scott/MachSMT. The artifact is available 
at: https: //zenodo.org/record/4458699. 


The rest of this paper is structured as follows. Section 2 provides the neces- 
sary background, Section 3 gives a technical description of MachSMT, Section 4 
gives an experimental evaluation of MachSMT over SMT-COMP 2019 and 2020, 
Section 5 provides an analysis of the experimental results, Section 6 describes 
related work, and Section 7 concludes the paper and discusses future work. 


2 Background 


In this section, we provide some background on algorithm selection via EHMs 
and PWCs, and the machine learning methods we use, such as principal compo- 
nent analysis (PCA) and k-fold cross validation. 


2.1 A Brief Overview of Algorithm Selection 


The idea of algorithm selection was first proposed and formalized by Rice et. 
al. [51] in 1976. Researchers have long known that given a set of different algo- 
rithms and implementations for the same specification or problem, it is often the 
case that one of these implementations may perform poorly on a given class of 
inputs while another might perform very well. This is especially true for prob- 
lems believed to be computationally hard (e.g., NP-hard). The reasons for this 
phenomenon could be as diverse as choice of data structures, fundamental differ- 
ences between algorithms, or the fact that heuristics implemented as part of one 
algorithm can exploit the input problem structure or the underlying hardware 
better than the others. 

It is natural to want to exploit the diversity in algorithmic approaches to 
minimize the cumulative runtimes. However, in practice users often deploy greedy 
algorithm selection — picking the best observed algorithm based on empirical 
analysis and testing. However, greedy algorithm selection can be sub-optimal 
when the best empirical algorithm has deficiencies relative to other algorithms 
on certain families of inputs. 

With the recent advances in AI and ML, researchers are beginning to lever- 
age these new technologies to advance algorithm selection. To the best of our 
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knowledge, there are two key approaches for ML-driven algorithm selection in 
the context of constraint solvers: through the use of Empirical Hardness Models 
(EHMs), and through Pairwise Ranking Comparators (PWCs). 


Algorithm Selection via Empirical Hardness Models (EHMs): Let Z be 
an input in the language of S with a corresponding feature vector % € R”. For 
an algorithm s € S, an EHM is a learnt function fs : R” — R that predicts the 
runtime of s on Z. An EHM is constructed with an ML regression model trained 
on collected runtime data. The algorithm is then selected by computing: 


argminf,(Z) 
sES 


Algorithm Selection via Pairwise Ranking Comparators (PWCs). Let 
P be the set of all unique pair sets (sets of size two). For each p = (S;,S;) € P, 
construct a learnt comparator fp : R” — {0,1}, that returns 0 if algorithm S; 
solves Z faster than Sj, and 1 otherwise. For an input Z with a feature vector 7, 
we compute a ranking of algorithms as a map r over S, where for s € S, r[s] is 
the ranking of solver s that represents: “how many solvers in S are faster than 
s in solving the input S”, or more formally: r[s] = 'p:sepfs(@). The selected 
solver is then the minimum ranked solver, i.e., 


argmin r[s] 
sES 


2.2 Supervised Learning, Adaptive Boosting, Curse of 
Dimensionality, and K-Fold Cross- Validation 


Supervised learning is one of the most predominant areas of ML. Supervised 
learning takes as input a dataset of features X and labels Y, and each datapoint 
& € X has a label y € Y. A datapoint is a real valued vector # € R” describing a 
sample. The learning problem is said to be a classification problem if the labels 
y € Y come from a fixed and finite set of classes C (e.g., a set of algorithms). 
Alternatively, the learning problem is a regression problem if the labels are real 
valued (e.g., runtimes). 

One efficient and effective approach to supervised learning is Adaptive Boost- 
ing (AdaBoost). AdaBoost is an ensemble approach to machine learning invented 
by Freund and Schapire et. al. [21], which won the Gödel Prize in 2003. In ensem- 
ble learning, a set of learning algorithms (e.g., weak learners) are trained, and 
predictions are made diplomatically across the set. In this paper, we exclusively 
consider AdaBoost to solve both the classification and regression problems for 
algorithm selection. We use an ensemble of 200 decision trees in the AdaBoost 
algorithm. For more, we refer to Drucker et al. [19]. 

While supervised learning has had tremendous impacts in several areas of 
research, there are pitfalls, such as the curse of dimensionality (CoD). Consider 
the convex polytope P formed around the convex hull of X. The volume of P 
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increases exponentially with the dimensionality of X requiring an exponential 
amount of datapoints to avoid extreme sparsity in X. Sparsity in datasets is 
one of the leading causes of poor performances in learnt models [28]. There 
is a large literature on managing the CoD. In this paper, we discuss feature 
selection and deploy dimensionality reduction solutions. In feature selection, a 
new dataset X’ is computed from X by selecting the subset of features that are 
the most performant on a validation dataset. Feature selection was deployed in 
the successful SATZilla algorithm selection tool for Boolean satisfiability. 

Despite the success of feature selection in SAT Zilla, feature selection does 
have some flaws. First, there is a significant loss of information. In the case of 
SAT Zilla, a feature vector composed of more than a hundred values describing 
an input is reduced to just five values. Second, the total number of feature 
subsets is exponential in the number of features. While there has been a great 
deal of research in reducing the time spent searching for high performing subsets 
[65,36], in our experiments, we found it to be the most computationally taxing 
component of the SATZilla framework. 

When evaluating the performance of a supervised learning model, a training 
set is used to construct the learnt model and a testing set is set aside to evaluate. 
However, this method alone can be prone to overfitting and selection bias [54,43]. 
Instead, researchers often use k—fold cross-validation to evaluate their learnt 
models. In k—fold cross validation, the dataset is split into k sets, and the learnt 
model is trained on k — 1 sets and is evaluated on the set that is left out. This 
process is repeated k times so each set gets evaluated. 


2.3 Unsupervised Learning and Principal Component Analysis 


Unsupervised learning, in contrast to supervised learning, is the study of detect- 
ing patterns in an unlabelled dataset X. Applications of unsupervised learning 
include dimensionality reduction [66,63], clustering [29,72], and anomaly detec- 
tion [38,1]. Principal Component Analysis (PCA) is an unsupervised learning 
dimensionality reduction technique. PCA computes an orthogonal transforma- 
tion of a dataset X composed of points in R” to a new data set X’ composed 
of points in R” where n’ < n. PCA is an incremental algorithm, wherein, each 
iteration a new component (or dimension) is computed. On the first iteration, 
a hyperplane is fit around the dataset X and its corresponding spanning vector 
is the first element of the basis around the transformation onto X’. On each 
subsequent iteration, a new hyperplane is computed under the additional con- 
straint of it being orthogonal to its predecessors. This process is repeated until 
the desired number of iterations is achieved [32,66]. 


3 An overview of MachSMT 


In this section, we provide an overview of the MachSMT tool. The architecture 
diagram of MachSMT is presented in Figure 1. 
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Feature ID Description 


1-4 


89-135 


135-150 


151 
152 
153-155 
156-158 
159 
160-161 


162 


Frequency of problem description grammatical 
constructs (e.g., assert, check-sat, etc.) 

Frequency of declaration /definition grammatical constructs 
(e.g., declare-const, define-fun, declare-sort, etc.) 
Frequency of the echo/exit grammatical constructs 
Frequency of the get-* grammatical constructs (e.g., 
get-model, get-unsat-core, etc.) 

Frequency of the push/pop incremental benchmark 
grammatical constructs 

Frequency of the reset /reset-assertions 
grammatical constructs 

Frequency of the set-* grammatical constructs 
(e.g., set-logic) 

Frequency of the forall/exists quantifiers 
Frequency of let bindings 

Frequency of core/Boolean constructs, 

sorts, and literals (e.g., true, Bool, and, =>, 

ite, distinct, etc.) 

Frequency of grammatical constructs of the 

theory of arrays (e.g., select, store, etc.) 

Frequency of grammatical constructs of the 

theory of bit-vectors (e.g., BitVec, bvor, bvuge, 
bvsge, bvult, etc.) 

Frequency of grammatical constructs of the 

theory of floating-point (e.g., fp.add, Float32, 
RNE, fp.eq, fp.isNaN, fp.to_real, etc.) 

Frequency of grammatical constructs of the 

theory of integers and reals (e.g., Int, Real, 

*, +, to_real, is_int) 

Average number of selects per array 

Average store chain depth per array 
Average/Median/Deviation of BV adder chains 
Number of forall/exists variables and their ratio 
Average quantifier nesting level 

Average arity and applications of 

uninterpreted functions 

Size of the smt2 file in bytes 


Table 1: Complete list of the 162 features used in MachSMT 
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Fig.1 Architecture of MachSMT. 


3.1 Features, Preprocessing, and Learning 


MachSMT uses a feature vector with 162 entries (i.e., dimensions). A complete 
description of each feature is provided in Table 1. We deploy two strategies to 
mitigate taxing feature calculation times, which can severely impair algorithm 
selection solutions. First, all features are entirely syntactical properties of the 
input. This is a major difference between MachSMT and other algorithm se- 
lection solutions, such as SAT Zilla. Second, all features are calculated within a 
strict and user-adjustable timeout (default 10s). On a timeout, the feature value 
is recorded as —1.0. 

MachSMT performs three key preprocessing steps before constructing any 
learnt models over a given dataset. We describe each subsequently. First, all 
feature values are scaled to zero mean and unit variance®. This data normal- 
ization technique is common in ML research and applications to improve both 
model efficiency and numerical robustness. The second step in the preprocessing 
pipeline is computing the polynomial interaction terms of degree two on the re- 
sultant normalized feature vector. These polynomial features make interacting 
correlations of features explicit. These first two preprocessing steps are included 
in the SATZilla preprocessing pipeline [71]. 

As discussed in Section 2, ML in a high dimensional space is prone to the 
curse of dimensionality. While other algorithm selection solutions (e.g., SAT Zilla) 
commonly implement feature selection solutions, we propose the use of learnt 
dimensionality, namely PCA. As discussed above, feature selection can be a 
proactive solution to the curse of dimensionality but presents many challenges 
when applying to SMT. Internally MachSMT manages more than a thousand 
learnt models, and calculating optimal feature subsets for each one is infeasible. 


3 = , where x is a feature sample, p is the mean across the specific feature on the 


training set, and o is the deviation across the specific feature on the training set. 
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The third and final preprocessing step is applying PCA on the resultant 
polynomial features. The final feature vector is composed of the first 35 principal 
components. PCA is the final step in the MachSMT preprocessing pipeline. The 
resultant feature set is used when constructing the learnt models with AdaBoost. 
We use AdaBoost for both regression in the EHMs and classifications in the 
PWCs. We configure AdaBoost with 200 decision tree estimators and linear 
loss. MachSMT uses scikit-learn and numpy as its ML backend and the entire 
tool is written in Python [49]. MachSMT is easily extensible and supports any 
ML model/pipeline under scikit-learn syntax. 


3.2 Variants of MachSMT 
MachSMT implements the following algorithm selection solutions. 


1. MachSMT-SolverEHM - This variant of MachSMT is analogous to the 
algorithm selection approach taken by SAT Zilla. As described in Section 2, 
an EHM is constructed for each solver, and the selected solver is computed 
by taking an argmin over all predictions. 

2. MachSMT-SolverLogicEHM - This approach is similar to MachSMT- 
SolverEHM, with the key difference being an EHM is constructed for every 
solver, logic pair. As state-of-the-art SMT solvers implement significantly 
different algorithms depending on the logic of the input problem, datapoints 
from different logics could negatively skew predictions. 

3. MachSMT-SolverP WC - This variant of MachSMT deploys the PairWise 
comparator approach as described in Section 2. In this variant of the PWC, 
comparators are trained for every pair of solvers across all provided data. 

4. MachSMT-SolverLogicP WC - This variant of MachSMT is analogous to 
MachSMT-SolverPWC, with the key difference that solver-wise comparators 
are constructed by only training on the benchmarks of a common logic. 


MachSMT by default creates models for all aforementioned approaches to 
algorithm selection. In evaluation, MachSMT evaluates each approach’s perfor- 
mance on each logic. In deployment, MachSMT uses the approach that had the 
best-observed performance in evaluation. 


3.3 Using MachSMT 


MachSMT consists of three core tools, which are used to build, evaluate, and 
deploy MachSMT, respectively. 


1. machsmt_build — This tool is the interface for building MachSMT’s database 
around the solvers and benchmarks provided by the user. It takes as input 
a csv data file denoting the columns ‘solver’, ‘benchmark’, and ‘score’. The 
output is a library directory containing the resultant database, and learnt 
models under default settings. 


machsmt_build -f data.csv -l /path/to/lib/dir 
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Improvement over Distance from 


Logic, Irack; Year Annee Random [%] Winner [%] VBS [%] 
QF_BVFP, $Q’20 Bitwuzla 195.1 191.1 86.2 
QF_BVFPLRA, SQ’20 MathSAT5 199.1 198.4 34.0 
QF_UFBV, SQ’19 Yices 153.5 113.3 95.3 
NRA, 5Q’19 Vampire 169.6 114.0 99.2 
QF_NRA, SQ’19 Yices 101.3 71.5 52.1 
QF_UFNRA, $Q’19  Yices 148.1 77.1 36.1 
QF_LIA, SQ’20 MathSAT5 132.6 71.5 46.4 
QF_UFBV, SQ’20 Yices 137.8 67.4 109.4 
QF_UFNRA, §Q’20  Yices 151.3 47.9 42.6 
QF_ABV, INC’20 Yices 169.4 50.8 114.6 
QF_NRA, SQ’20 Yices 82.5 41.2 46.5 
QF_AUFLIA, SQ’20 Yices 200.0 37.2 27.9 
BV, $Q’20 CVC4 112.1 30.6 117.8 
QF_LRA, SQ’19 SPASS-SATT 89.3 28.4 59.5 
QF_UFLRA, INC’20 Z3 133.3 26.2 19.9 
QF_ANIA, SQ’20 MathSAT5 199.0 26.1 61.6 
QF_LIA, SQ’19 SPASS-SATT 161.5 29.8 66.3 
BV, SQ’19 Q3B 91.8 25.0 83.7 
LIA, SQ’20 CVC4 172.5 22.3 19.6 
QF_UFNIA, SQ’20 CVC4 125.6 21.9 105.0 
UFDTNIRA, SQ’20 Vampire 123.9 24.0 92.6 
QF_UFLRA, INC’19 Z3 110.0 19.6 22.0 
QF_FP, SQ’19 COLIBRI 41.6 18.4 62.8 
QF_AUFBV, 8Q’20  Yices 82.0 20.4 3.6 


Table 2: Selected results of MachSMT on data from SMT-COMP 2019 and 
2020. All numbers are percent differences of PAR-2 scores across all benchmarks. 
Columns 3 and 4 show the improvement over random selection and competition 
winners (higher is better). Column 5 shows the PAR-2 difference to the VBS 
(lower is better). 


2. machsmt_eval — This tool takes as input the library directory generated by 
machsmt_build and evaluates it under k-fold cross validation and provides 
a summary of results. It further tunes MachSMT to use the best empirically 
observed variant based on the logic and track of the input benchmark. 


machsmt_eval -1 /path/to/lib/dir 
3. machsmt — This tool is the primary interface to MachSMT” algorithm se- 
lection. Provided an input benchmark and its library files, it will output a 


ranking of solvers that are predicted to solve the benchmark the fastest. 


machsmt benchmark.smt2 -1 /path/to/lib/dir 
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Fig.2 Plot for BV in the Single Query (SQ) Track in SMT-COMP ’19. 


3.4 User-defined Features 


We include a simple interface for users to extend the considered features in 
MachSMT’s algorithm selection. All that is required is to create a Python 
method that returns a single floating-point number (or an iterable object thereof) 
representing the feature. As input, the user enters the path of the SMT-LIB 
input, as well as its logic and track. If a user feature is to be considered by 
MachSMT, the user-defined procedure should return its floating-point represen- 
tation; otherwise, it returns none. All user-defined features are automatically 
included in building MachSMT. These custom features in principal can signif- 
icantly affect the accuracy of MachSMT when engineered to target a specific 
class of benchmarks. 


4 Experimental Evaluation of MachSMT on SMT-COMP 
2019 and 2020 Data 


In this section, we present the evaluation of our MachSMT tool (refer to Ta- 
ble 2 and CDF plots in Figures 2-6), specifically with the benchmarks, solvers, 
and solver runtime analysis from SMT-COMP 2019 and 2020. The artifact is 
available at: https: //zenodo.org/record /4458699. 
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Fig.3 Plot for NRA in the Single Query (SQ) Track in SMT-COMP 719. 


4.1 Experimental Setup and Methodology 


In this experiment, we used the benchmarks, timing analysis, and solvers pro- 
vided by the organizers of the SMT-COMP 2019 and 2020 competitions [31,6]. In 
both years, all solver input queries were performed on the StarExec computing 
service [58], which consists of a cluster of 2.4 GHz Intel Xeon machines running 
Red Hat Enterprise Linux 7.2. Each solver/benchmark pair was configured to 
have 4 cores and 60GB of memory available. The time limit for each pair was 
2400 seconds in 2019, and 1200 seconds in 2020. 

We evaluate MachSMT and all of its variants using k-fold cross validation 
(with & = 5). In cross validation, the dataset is randomly partitioned into k 
subsets per division. A model is then trained over k — 1 subsets and makes pre- 
dictions over the subset that is excluded from training. This process is repeated 
to obtain fair predictions for each subset. Cross validation is commonly deployed 
to analyze machine learning models. For more details, please see Section 2. 


4.2 Experimental Results 


For every division, we evaluated MachSMT by checking whether we beat the 
competition winner from each division. For the sequential tracks, we evaluate 
solvers across, according to PAR-2 scores (i.e., the wallclock runtime on success- 
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Fig.4 Division QF_BVFPLRA in the Single Query Track in SMT-COMP 2020. 


ful termination, otherwise twice the wallclock timeout)* [42]. For incremental 
tracks, we use the following formula: 


w + (2«t/n) * (n —m) 


where w is the wall clock runtime, t is the wallclock timeout, n is the total 
number of check-sats in the benchmark, and m is the total number of check-sats 
successfully solved. 


We present select results in Table 2. We consider three baselines when evalu- 
ating MachSMT, namely: random algorithm selection, the competition winner, 
and the virtual best solver (VBS) (note, VBS is perfect algorithm selection and 
cannot be beaten). We consider all divisions of at least 25 benchmarks and ob- 
serve MachSMT to improve on the competition winner in 54 out of 85. We report 
the results for MachSMT-SolverLogicEHM in the table as it is by far the most 
performant, dominating in all divisions except for 4. 


We present select CDF plots in Figures 2-6. A CDF plot is a visualization 
of how a solver performs on a database of inputs. A point (X,Y) denotes that a 
solver S solves Y inputs within X seconds each. 


4 In case of an incorrect answer, the score is recorded as 10 times the wallclock timeout. 
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Fig.5 Division QF_LIA in the Single Query Track in SMT-COMP 2020. 


5 Analysis and Discussion of Results 


In Section 3.2, we describe four formulations of MachSMT. In our evaluation 
(see Table 2), we observe MachSMT-SolverLogicEHM to be significantly more 
performant than all other formulations. When evaluating over SMT-COMP, in 
all divisions that MachSMT improved over the competition winner, MachSMT- 
SolverLogicEHM was the most performant in all except for three (which were 
won by MachSMT-SolverLogicPWC). 

Our experimental results validate the idea that algorithm selection (in par- 
ticular through the use of EHMs) can be a powerful way to address the com- 
binatorial explosion that solver users face when trying to decide which solver- 
configuration pair is best suited for their application. We note that MachSMT is 
particularly powerful in the context of logics, such as QF_UFBV, that are derived 
from a diverse set of applications and a wide variety of algorithms have been 
designed to solve them. As has been noted in previous work, algorithm selection 
methods work well for non-homogeneous benchmarks, especially where there is 
no single algorithm (solver) that performs the best across the board. EHMs are 
an effective way to distinguish between such algorithms given a problem instance 
and predict which one might perform the best on said instance. 

One major threat to the validity of any ML solution is the generalizability of 
the learnt models on unseen data. It has been noted in previous work that a prac- 
tical way to address this issue is to use k—fold cross validation scheme [54,43], 
thus motivating our use of this approach in our experiments. We further note 
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Fig.6 Division QF_UFBV in the Single Query Track in SMT-COMP 2020. 


that our evaluation of MachSMT includes decades of runtime analysis and more 
than 100 GB of benchmarks spanning numerous applications, giving us greater 
confidence in the robustness of our results. 


6 Related Work 


In this section we provide an overview of previous work on algorithm selection 
in the context of constraint solvers and contrast it with MachSMT. 


6.1 Key differences between SAT Zilla and MachSMT 


As mentioned above, SAT Zilla was the first algorithm selection method in the 
context of logic solvers [67]. While our work is inspired by SATZilla, MachSMT 
differs from SATZilla in several key ways. First, SATZilla deploys a feature 
selection scheme to avoid the curse of dimensionality. While good in practice for 
the SAT setting, feature selection does lose significant amounts of information. 
Further, it can be very expensive to compute optimal feature subsets. 

By contrast, MachSMT leverages a learnt dimensionality reduction scheme, 
namely, Principal Component Analysis (PCA). The key advantage of PCA is 
that it does not perform a search for optimal feature subset (like one has to do 
in the context of feature selection), and hence is significantly more efficient. In 
fact, a feature selection method is unlikely to scale for SMT solvers, unlike SAT, 
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simply because of the significantly larger number of features, logics, and solvers 
that one has to contend with. Second, MachSMT deploys a modern ML pipeline, 
including an ensemble learning approach, namely Adaptive Boosting [21]. 


6.2 Algorithm Selection for Logic Solvers and Their Applications 


Algorithm selection tools have a rich history and have been around since at least 
1976 when Rice et al. were the first to propose it [51]. Algorithm selectors have 
been extensively used in many contexts, e.g., classifiers for machine learning [2], 
combinatorics [37], and other NP-hard optimization problems [60,62]. Within 
the context of solvers, algorithm selectors have been proposed for QBF [50,41], 
SAT [67,68,69], CSP solvers [25,3,34], and recommenders for ATP tools [59,61]. 

In the setting of SMT solver applications, symbolic execution tools have used 
algorithm selection strategies [64] and portfolio strategies [33] for the specific 
classes of instances within the context of the bit-vector theory. This would be 
an ideal use case of MachSMT, since we provide a more complete solution. 

There have been other works using machine learning to improve the perfor- 
mance of SMT solvers. Balunovic et al. [5] use neural networks and synthesis 
to find tactics and strategies for three SMT-LIB theories. A previous version of 
our work proposed an algorithm selection tool for the QF_FP theory [57]. To 
the best of our knowledge, MachSMT is the first publicly available tool for the 
entirety of SMT-LIB. Other works have leverage machine learning to improve 
internal heuristics in solvers [12,52,40] 

Pairwise ranking has been used in algorithm selection in the latest versions 
of SATZilla [70], as well as in other settings such as variable selection in the 
context of splitting heuristics in divide-and-conquer parallel SAT solvers [45]. 


7 Conclusions and Future Work 


In this paper, we presented MachSMT, the first algorithm selection tool that 
spans the entirety of the SMT-LIB logics. MachSMT is designed to be user- 
friendly and easily modifiable by users for their specific application and SMT 
solvers of interest. 

Using MachSMT, we observe improvement in 54 out of 85 divisions in all 
tracks from the SMT-COMP 2019 and 2020, with up to a 198.4% improvement 
for the QF BVFPLRA SQ ’20 division in PAR-2 score. Most of the logics on 
which we don’t see improvement are ones for which we have very few benchmarks. 

For future work, we plan to extend our scoring scheme to take into account 
model validation and unsat core divisions. We further plan to extend our feature 
set with more (theory-)specific features based on feedback from the SMT com- 
munity. It is very likely that users may have domain-specific knowledge about 
certain features that might be most predictive of solver runtime for their par- 
ticular application. Hence, we have provided an interface to easily extend and 
specialize MachSMT to a user’s specific setting. 
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Abstract. Recent advances have shown how decision trees are apt data 
structures for concisely representing strategies (or controllers) satisfying 
various objectives. Moreover, they also make the strategy more explainable. 
The recent tool dtControl had provided pipelines with tools support- 
ing strategy synthesis for hybrid systems, such as SCOTS and Uppaal 
Stratego. We present dtControl 2.0, a new version with several fun- 
damentally novel features. Most importantly, the user can now provide 
domain knowledge to be exploited in the decision tree learning process 
and can also interactively steer the process based on the dynamically 
provided information. To this end, we also provide a graphical user inter- 
face. It allows for inspection and re-computation of parts of the result, 
suggesting as well as receiving advice on predicates, and visual simulation 
of the decision-making process. Besides, we interface model checkers of 
probabilistic systems, namely STORM and PRISM and provide dedicated 
support for categorical enumeration-type state variables. Consequently, 
the controllers are more explainable and smaller. 


Keywords: Strategy representation - Controller representation - Deci- 
sion Tree - Explainable Learning - Hybrid systems - Probabilistic Model 
Checking - Markov Decision Process 


1 Introduction 


A controller (also known as strategy, policy or scheduler) of a system assigns to 
each state of the system a set of actions that should be taken in order to achieve a 
certain goal. For example, one may want to satisfy a given specification of a robot’s 
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behaviour or exhibit a concurrency bug appearing only in some interleaving. It 
is desirable that the controllers possess several additional properties, besides 
achieving the goal, in order to be usable in practice. Firstly, controllers should 
be explainable. Only then can they be understood, trusted and implemented 
by the engineers, certified by the authorities, or used in the debugging process 
[13]. Secondly, they should be small in size and efficient to run. Only then they 
can be deployed on embedded devices with limited memory of a few kilobytes, 
while the automatically synthesized ones are orders of magnitude larger 49]. 
Thirdly, whenever the primary goal, e.g. functional correctness, is accompanied 
by a secondary criterion, e.g. energy efficiency, they should be performant with 
respect to this criterion. 

Automatic controller synthesis is able to provide controllers for a given goal 
in various domains, such as probabilistic systems E], hybrid systems 
or reactive systems [85]. In some cases, even the performance can be 
reflected . However, despite recent interest in explainability in connection to 
Al-based controllers |2| and despite typically small memories of embedded devices, 
automatic techniques for controller synthesis mostly fall short of producing small 
explainable results. A typical outcome is a controller in the form of a look-up 
table, listing the actions for each possible state, or a binary decision diagram 
(BDD) representation thereof. While the latter reduces the size to some 
extent, none of the two representations is explainable: the former due to its size, 
the latter due to the bit-level representation with all high-level structure lost. 
Instead, learning representations in the form of decision trees (DT) has 
been recently explored to this end [3]. DTs turn out to be usually smaller 
than BDD but do not drown to the bit level and are generally well known for 
their interpretability and explainability due to their simple structure. However, 
despite showing significant potential, the state-of-the-art tool dtControl |4| uses 
predicates without natural interpretation, and moreover, the best size reductions 
are achieved using determinization, i.e. making the controller less permissive, 
which negatively affects performance [7]. 


Example 1 (Motivating example). Consider the cruise control model of [34], 
where we want to control the speed of our car so that it never crashes into the 
car in front while, as a secondary performance objective, keeping the distance 
between the two cars small. 

A safe controller for the this model as returned by Uppaal Stratego, is 
a lookup table of size 418 MB with 300,000 lines. The respective BDD has 
1,448 nodes with all information bit-blasted. Using adaptations of standard 
DT-construction algorithms, as implemented in dtControl, we can get a DT 
with 987 nodes, which is still too large to be explained. Using determinization 
techniques, the controller can be compressed to 3 nodes! However, then the DT 
allows only to decelerate until the minimum velocity. This is safe, as we cannot 
crash into the car in front, but it does not even attempt at getting close to the 
front car, and thus has a very bad performance. 

One can find a strategy with optimal performance, retaining the maximal 
permissiveness, not determinizing at all, which can be represented by a DT with 11 
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nodes. A picture of this DT as well as reasoning how to derive the predicates from 
the kinematic equations is in the extended version of this paper |5| Appendix A]. 

However, exactly because the predicates are based on the domain knowledge, 
namely the kinematic equations, they take the form of algebraic predicates and 
not simply linear predicates, which are the only ones in dtControl and commonly 
in the machine-learning literature on DTs. A 


This motivating example shows that using domain knowledge and algebraic 
predicates, available now in dtControl 2.0, one can get smaller representation 
than when using existing heuristics. Further, it improves the performance of the 
DT, and it is easily explainable, as it is based on domain knowledge. In fact, 
the discussed controller is so explainable that it allowed us to find a bug in the 
original model. In general, using dtControl 2.0 a domain expert can try to 
compress the controller, thus gain more insight and validate that it is correct. 
Another example of this has been reported from the use of dtControl in the 
manufacturing domain B1. 

While automatic synthesis of good predicates from the domain knowledge may 
seem as distant as automatic synthesis of program invariants or automatic theorem 
provers, we adopt the philosophy of those domains and offer semi-automatic 
techniques. 

Additionally, if not performance but only safety of a controller is relevant, 
we can still benefit from determinization without drawbacks. To this end, we 
also provide a new determinization procedure that generalizes the extremely 
successful MaxFreq technique of |4| and is as good or better on all our examples. 

To incorporate the changes just discussed, namely algebraic predicates, semi- 
automatic approach, and better determinization, we have also reworked the 
tool and its interfaces. To begin with, the software architecture of dtControl 
2.0 is now very modular and allows for easy further modifications, as well as 
adding support for new synthesis tools. In fact, we have already added parsers 
for the tools STORM and PRISM [B32], and thus we support probabilistic 
models as well. Since these models also contain categorical (or enumeration- 
type) variables, e.g. protocol states, we have also added support for categorical 
predicates. Furthermore, we added a graphical user interface that not only is 
easier to use than the command-line interface, but also allows to inspect the DT, 
modify and retrain parts of it, and simulate runs of the model under its control, 
further increasing the possibilities to explain the DT and validate the controller. 

Summing up, the main improvements of dtControl 2.0 over the previous 
version |4| are the following: 


— Support of algebraic predicates and categorical predicates 

— Semi-automatic interface and GUI with several interactive modes 

— New determinization procedure 

— Interfaces for model checkers PRISM and Storm and experimental evidence 
of improvements on probabilistic models compared to BDD 


The paper is structured as follows. After recalling necessary background in 
Section P| we give an overview of the improvements over the previous version of 
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the tool from the global perspective in Section [8] We detail on the algorithmic 
contribution in Sections |4| (predicate domains), |5| (predicate selection) g 
(determinization). Section [7| provides experimental evaluation and Section |8| 
concludes. 


Related work. DTs have been suggested for representing controllers of and 
counterexamples in probabilistic systems in 0], however, the authors only discuss 
approximate representations. The ideas have been extended to other setting, such 
as reactive synthesis and hybrid systems [7|. More general linear predicates 
have been considered in leaves of the trees in |8|. dtControl 2.0 contains the 
DT induction algorithms from [3]. The differences to the previous version 
of the tool dtControl are summarized above and schematically depicted in 
Figure 

Besides, DTs have been used to represent and learn strategies for safety 
objectives in and to learn program invariants in 21). Further, DTs were 
used for representing the strategies during the model checking process, namely 
in strategy iteration or in simulation-based algorithms [42]. Representing 
controllers exactly using a structure similar to DT (mistakenly claimed to be an 
algebraic decision diagram) was first suggested by 23, however, no automatic 
construction algorithm was provided. 

The idea of non-linear predicates has been explored in [28]. In that work, 
however, it is not based on domain knowledge, but rather on projecting the 
state-space to higher dimensions. 

BDDs have been commonly used to represent strategies in planning (15), 
symbolic model checking as well as to represent hybrid system controllers 
[30]. While BDD operate only on Boolean variables, they have the 
advantage of being diagrams and not trees. Moreover, they correspond to Boolean 
functions that can be implemented on hardware easily. proposes an automatic 
compression technique for numerical controllers using BDDs. Similar to our work, 
considers the problem of obtaining concise BDD representation of controllers 
and presents a technique to obtain smaller BDDs via determinization. However, 
BDDs are difficult to explain due to variables being bit-blasted and their size is 
very sensitive to the chosen variable ordering. An extension of BDDs, algebraic 
or multi-terminal decision diagrams (ADD/MTBDD) [20], have been used in 
reinforcement learning for strategy synthesis [47]. ADDs extend BDDs with 
the possibility to have multiple values in the terminal nodes, but the predicates 
still work only on boolean variables, retaining the disadvantages of BDDs. 


2 Decision tree learning for controller representation 


In this section, we briefly describe how controllers can be represented as decision 
trees as in gl. We give an exemplified overview of the method, pinpointing the 
role of our algorithmic contributions. 

A (non-deterministic, also called permissive) controller is a map C : S ++ 24 
from states to non-empty sets of actions. This notion of a controller is fairly 
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general; the only requirement is that it has to be memoryless and non-randomized. 
These kind of controllers are optimal for many tasks such as expected (discounted) 
reward, reachability or parity objectives. Moreover, even finite-memory controllers 
can be written in this form by considering the product of the state space with 
the finite memory as the domain, for example, like in LTL model checking. 

Decision trees (DT), e.g. [38], are trees where every leaf node is labelled with 
a non-empty set of actions and every inner node is labelled with a predicate 
p: S> {true, false}. 


Uo vf d actions Vo > 0 
0 0 5 {neu} ba ee 
2 6 10 = {dec, neu, acc} up >4 {neu} 
2 6 15 = {dec, neu, acc} A Me 
4 4 15 {dec, neu} { dec, neu, acc} on neu} 
(a) (b) 


Fig. 1: An example controller based on the cruise-control model in the form of a lookup 
table (left), and the corresponding decision tree (right). 


Example 2 (Decision tree representation). As an example, consider the controller 
given in Figure [Ia] It is a subset of the real cruise-control case study from the 
motivating Example [I] A state is a 3-tuple of the variables vo, v¢ and d, which 
denote the velocity of our car, the front car and the distance between the cars 
respectively. In each state, our car may be allowed to perform a subset of the 
following set of actions: decelerate (dec), stay in neutral (new) or accelerate (acc). 
A DT representing this lookup table is depicted in Figure 

Given a state, for example vo = vf = 4, d = 10, the DT is evaluated as follows: 
We start at the root and, since it is an inner node, we evaluate its predicate 
Up > 0. As this is true, we follow the true branch and reach the inner node 
labelled with the predicate vy > 4. This is false, so we follow the false branch and 
reach the leaf node labelled { dec, neu}. Hence, we know that all three possibilities 
of decelerating, staying neutral and accelerating are allowed by the controller. A 


To construct a DT representation of a given controller, the following recursive 
algorithm may be used. Note that it is heuristic since constructing an optimal 
binary decision tree is an NP-complete problem 27]. 


Base case: If all states in the the controller agree on their set of actions B (i.e. 
for all states s we have C(s) = B), return a leaf node with label B. 

Recursive case: Otherwise, we split the controller. For this, we select a predicate 
p and construct an inner node with label p. Then we partition the controller 
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by evaluating the predicate on the state space, and recursively construct one 
DT for the sub-controller on states {s € S | p(s)} where the predicate is 
true, and one for the sub-controller where it is false. These controllers are 
the children of the inner node with label p and we proceed recursively. 


For selecting the predicate, we consider two hyper-parameters: The domain 
of the predicates (see Section [4p and the way to select predicates (see Section [5). 
The selection is typically performed by selecting the predicate with the lowest 
impurity; this is a measure for how homogenous (or “pure”) the controller is after 
the split, in other words the degree to which all the states agree on their actions. 

We also consider a third hyper-parameter of the algorithm, namely deter- 
minization by safe early stopping (see Section [6p. This modifies the base case as 
follows: if all states in the controller agree on at least one action a (i.e. for all 
states s we have a € C(s)), then we return a leaf node with label {a}. This variant 
of early stopping ensures that, even though the controller is not represented 
exactly, still for every state a safe action is allowed. 

Hence, if the original controller satisfies some property, e.g. that a safe set of 
states is never left, the DT construction algorithm ensures that this property is 
retained. This is because our algorithm represents the strategy exactly (or a safe 
subset, in case of determinization) and does not generalize as DTs typically do in 
machine learning. DTs are suitable for both tasks, as both rely on the strength 
of DTs exploiting underlying structure. 


Remark 1. Note that for some types of objectives such as reachability, deter- 
minization of permissive strategies might lead to a violation of the original 
guarantees. For example, consider a strategy that allows both a self-looping and 
a non-self-looping action at a particular state. If the determinizer decides to 
restrict to the self-looping action, the reachability property may be violated in the 
determinized strategy. However, this problem can be addressed when synthesizing 
the strategy by ensuring that every action makes progress towards the target. 


3 Tool 


dtControl 2.0 is an easy-to-use open-source tool for representing memoryless 
symbolic controllers as more compact and more interpretable DTs, while retaining 
safety guarantees of the original controllers. Our website ee 
‘tum .de| offers hyperlinks to the easy-to-install pip packagd?| the documentation 
and the source code. Additionally, the artifact that has passed the TACAS 21 
artifact evaluation is available here [6]. 

The schema in Figure [2] illustrates the workflow of using dtControl, high- 
lighting new features in red. Considering dtControl as a black box, it shows that 


given a controller, it returns a DT representing the controller and also offers the 
possibility to simulate a run of the system under the control of the DT, visualizing 


3 pip is a standard package-management system used to install and manage software 
packages written in Python. 
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Fig. 2: An overview of the components of dtControl 2.0, thereby showing software 
architecture and workflow. Contributions of this paper are highlighted in red. 


the decisions made. The controller can be input in various formats, including 
the newly supported strategy representations of the well-known probabilistic 
model checkers PRISM and STORM 17]. The DT is output in several machine 
readable formats, and as C-code that can be directly used for executing the 
controller on embedded devices. Note that this C-code consists only of nested 
if-else-statements. The new graphical user interface also offers the possibility 
to inspect the graph in an interactive web user interface, which even allows to 
edit the DT. This means that parts of the DT can be retrained with a different 
set of hyper-parameters and directly replaced. This way, one can for example 
first train a determinized DT and then retrain important parts of it to be more 
permissive and hence more performant for a secondary criterion. Figure [3] shows 
a screenshot of the newly integrated graphical user interface. 

Looking at the inner workings of dtControl, we see the three important 
hyper-parameters that were already introduced in Section ] predicate domain, 
predicate selector, and determinizer. For each of these, dtControl offers various 
choices, some of which were newly added for version 2.0. Most prominently, the 
user now has the possibility to directly influence both the predicate domain and 
the predicate selector, by providing domain knowledge and thus also additional 
predicates, or by directly using the interactive predicate selection. More details 
on the predicate domain and how domain knowledge is specified can be found in 
Section [4] The different ways to select predicates, especially the new interactive 
mode, are the topic of Section |5| Our new insights into determinization are 


dtControl 2.0 333 


X Ê dtControl 


Controller File Experiments 
firewire_abst.prism Browse 
Numeric Categorical Safe 
# Controller Preset Determinize Predicates Predicates Impurity Tolerance Pruning Actions 
Metadata File (Optional) er 
Choose metadata file Browse all 
1 10rooms.scs mlentropy auto axisonly multisplit multilabelentropy 0.00001 false ir 
Preset a Janit 
2 cartpole.scs maxfreq maxfreq axisonly multisplit entropy 0.00001 false îr 
mlentropy 
3 firewire_abst.prism mlentropy auto axisonly multisplit multilabelentropy 0.00001 false îr 
> Show advanced options 
Add 
Results 
Experiment # Controller Preset Status # Inner Nodes # Leaf Nodes Construction time Actions 
1 10rooms.scs mlentropy Completed 3 4 00:00:01.37 © 
2 cartpole.scs maxfreq Completed 5 6 00:00:00.05 © 


Fig. 3: Screenshot of the new web-based graphical user interface. It offers a sidebar for 
easy selection of the controller file and hyper-parameters, an experiments table where 
benchmarks can be queued, and a results table in which some statistics of the run are 
provided. Moreover, users can click on the ‘eye’ icon in the results table to inspect the 
built decision tree. 


described in Section [6] To support the user in finding a good set of hyper- 
parameters, dtControl also offers extensive benchmarking functionality, allowing 
to specify multiple variants and reporting several statistics. 


Technical notes. dtControl 2.0 is written in Python 3 following an architec- 
ture closely resembling the schema in Figure |2| The modularity, along with our 
technical documentation, allows users to easily extend the tool. For example, 
supporting another input format is only a matter of adding a parser. 

dtControl 2.0 works with Python version 3.7.9 or higher. The core of the 
tool which runs the learning algorithms requires numpy 23], pandas and 
scikit-learn ja] and optionally the library for the heuristic 0C1 (39|. The 
algebraic predicates rely on SymPy and SciPy jas]. The web user interface is 
powered by Flask |1| and D3.js (9). 


4 Predicate domain 


The domain of the predicates that we allow in the inner nodes of the DT is of key 
importance. As we saw in the motivating Example [I] allowing for more expressive 
predicates can dramatically reduce the size of the DT. 

We assume that our state space is structured, i.e. it is a Cartesian product of 
the domain of the variables (S = Sı x ... X Sn). We use s; to refer to the i-th 
state-variable of a state s € S. In Example P| the three state-variables are the 
velocity of our car, the velocity of the front car, and the distance. 
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We first give an overview of the predicate domains dtControl 2.0 supports, 
before discussing the details of the new ones. 

Azis-aligned predicates have the form s; < c, where c is a rational constant. 
This is the easiest form of predicates, and they have the advantage that there are 
only finitely many, as the domain of every state-variable is bounded. However, 
they are also least expressive. 

Linear predicates (also known as oblique [39)) have the form 0, si: a; < c, 
where a; are rational coefficients and c is a rational constant. They have the 
advantage that they are able to combine several state-variables which can lead to 
saving linearly many splits, cf. Fig. 5.2]. The disadvantage of these predicates 
is that there are infinitely many choices of coefficients, which is why heuristics 
were introduced to determine a good set of predicates to try out gl. However, 
heuristically determined coefficients and combinations of variables can impede 
explainability. 

Algebraic predicates have the form f(s) < c, where f is any mathematical 
function over the state-variables and c is a rational constant. It can use elementary 
functions such as exponentiation, log, or even trigonometric functions. Example 
illustrated how this can reduce the size and improve explainability. More 
discussion of these predicates follows in Section [4.2] 

Categorical predicates are special predicates for categorical (enumeration- 
type) state-variables such as colour or protocol state, and they are discussed in 
Section 


4.1 Categorical predicates 


Categorical state-variables do not have a numeric domain, but instead are un- 
ordered and qualitative. They commonly occur in the models coming from the 
tools PRISM and STORM. 


Example 3. Let one state-variable be ‘colour’ with the domain {red, blue, green}. 
A simple approach is to assign numbers to every value, e.g. red = 0, blue = 
1,green = 2, and treat this variable as numeric. However, a resulting predi- 
cate such as colour < 2 is hardly explainable and additionally depends on the 
assignment of numbers. For example, it would not be possible to single out 
colour € {red, green} using a single predicate, given the aforementioned numeric 
assignment. Using linear predicates, for example adding half of the colour to 
some other state-variable, is even more confusing and dependent on the numeric 
assignment. A 


Instead of treating the categorical variables using their numeric encodings, 
dtControl 2.0 supports specialized algorithms from literature, see e.g. [44]. 
They work by labelling an inner node with a categorical variable and performing 
a (possibly non-binary) split according to the value of the categorical variable. 
The node can have at most one child for every possible value of the categorical 
variable, but it can also group together similarly behaving values, see Figure [4] for 
an example. For the grouping, dtControl 2.0 uses the greedy algorithm from 
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Chapter 7] called attribute-value grouping. It proceeds by first considering to 
have a branch for every single possible value of the categorical variable, and then 
merging branches as long as it improves the predicate; see |5| Appendix C] for 
the full pseudocode of the algorithm. 

In our experiments we found that the grouping algorithm sometimes did not 
merge branches in cases where it would actually have made the DT smaller or more 
explainable. This is because the resulting impurity, the goodness of a predicate, 
could be marginally worse due to floating-point inaccuracies. Thus, we introduce 
tolerance, a bias parameter in favour of larger value groups. When checking 
whether to merge branches, we do not require the impurity to improve, but we 
allow it to become worse up to our tolerance. Setting tolerance to 0 corresponds 
exactly to the algorithm from [44], while setting tolerance to oo results in merging 
branches until only two remain, thus producing binary predicates. 

To allow dtControl 2.0 to use categorical predicates, the user has to provide 
a metadata file, which tells the tool which variables are categorical and which 
are numeric; see |5| Appendix B.1] for an example. 


4.2 Algebraic predicates 


It is impossible to try out every mathematical expression over the state-variables, 
and it would also not necessarily result in an explainable DT. Instead, we allow 
the user to enter domain knowledge to suggest templates of predicates that 
dtControl 2.0 should try. See |5| Appendix B.2] for a discussion of the format 
in which domain knowledge can be entered. 

Providing the basic equations that govern the model behaviour can already 
help in finding a good predicate, and is easy to do for a domain expert. Addition- 
ally, dtControl 2.0 offers several possibilities to further exploit the provided 
domain knowledge: 

Firstly, the given predicates need not be exact, but may contain coefficients. 
These coefficients can be both completely arbitrary or may come from a finite 
set suggested by the user. For coefficients with finite domain, dtControl 2.0 
tries all possibilities; for arbitrary coefficients, it uses curve fitting to find a good 


color color 
a wy N 
{a} {b} {c} {a} {b} 


(a) (b) 


Fig. 4: Two examples of a categorical split. On the left, all possible values of the state- 
variable colour lead to a different child in a non-binary split. On the right, red and 
green lead to the same child, which is a result of grouping similar values together. 
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value. For example, the user can specify a predicate such as d+ (vo — vy) + co > C1 
with co being an arbitrary rational number and cı € {0,5,10}. 

Secondly, the interactive predicate selection (see Section allows the user 
to try out various predicates at once and observe their respective impurity in 
the current node. The user can then choose among them as well as iteratively 
suggest further predicates, inspired by those where the most promising results 
were observed. 

Thirdly, the decisions given by a DT can be visualized in the simulator, 
possibly leading to better understanding the controller. Upon gaining any further 
insight, the user can directly edit any subtree of the result, possibly utilizing the 
interactive predicate selection again. 


5 Predicate selection 


The tool offers a range of options to affect the selection of the most appropriate 
predicate from a given domain. 


Impurity measures: As mentioned in Section 2] the predicate selection is typically 
based on the lowest impurity induced. The most commonly used impurity mea- 
sure (and the only one the first version of dtControl supported) is Shannon’s 
entropy |46|. In dtControl 2.0, a number of other impurity measures from the 
literature [43| are available. However, our results indicate that 
entropy typically performs the best, and therefore it is used as the default option 
unless the user specifies otherwise. Due to lack of space, we delegate the details 
and experimental comparison between the impurity measures to |5| Appendix D]. 


Priorities: dtControl 2.0 also has the new functionality to assign priorities to 
the predicate generating algorithms. Priorities are rational numbers between 0 
and 1. The impurity of every predicate is divided by the priority of the algorithm 
that generated it. For example, a user can use axis-aligned splits with priority 
1 and a linear heuristic with priority 1/2. Then the more complicated linear 
predicate is only chosen if it is at least twice as good (in terms of impurity) as 
the easier-to-understand axis-aligned split. A predicate with priority 0 is only 
considered after all predicates with non-zero priority have failed to split the data. 
This allows the user to give just a few predicates from domain knowledge, which 
are then strictly preferred to the automatically generated ones, but which need 
not suffice to construct a complete DT for the controller. 


Interactive predicate selection: dtControl 2.0 offers the user the possibility 
to manually select the predicate in every split. This way, the user can prefer 
predicates that are explainable over those that optimize the impurity. 

The screenshot of the interactive interface in Appendix F] shows the 
information that dtControl 2.0 provides. The user is given some statistics 
and metadata, e.g. minimum, maximum and step size of the state-variables in 
the current node, a few automatically generated predicates for reference and all 
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predicates generated from domain knowledge. The user can specify new predicates 
and is immediately informed about their impurity. Upon selecting a predicate, 
the split is performed and the user continues in the next node. 

The user can also first construct a DT using some automatic algorithm 
and then restart the construction from an arbitrary node using the interactive 
predicate selection to handcraft an optimized representation, or at any point 
decide that the rest of the DT should be constructed automatically. 


6 New insights about determinization 


In our context, determinization denotes a procedure that, for some or all states, 
picks a subset of the allowed actions. Formally, a determinization function 6 
transforms a controller C into a “more determinized” C”, such that for all states 
s € C we have Ø € C’(s) C C(s). This reduces the permissiveness, but often 
also reduces the size. Note that, for safety controllers, this always preserves 
the original guarantees of the controller. For other (non-safety) controllers, see 
Remark [I] 

dtControl 2.0 supports three different general approaches to determinizing a 
controller: pre-processing, post-processing and safe early stopping. Pre-processing 
commits to a single determinization before constructing the DT. Post-processing 
prunes the DT after its construction, e.g. safe pruning in [7]. The basic idea of 
safe early stopping is already described in Section |2} if all states agree on at 
least one action, then instead of continuing to split the controller, stop early 
and return a leaf node with that common action. Alternatively, to preserve more 
permissiveness, one can return not only a single common action, but all common 
actions; formally, return the maximum set B such that for all states s in the 
node B C C(s). 

The results of show that both pre-processing and post-processing are 
outperformed by an on-the-fly approach based on safe early stopping. This is 
because pre-processing discards a lot of information that could have been useful 
in the DT construction and post-processing can only affect the bottom-most 
nodes of the resulting DT, but usually not those close to the root. 

We now give a new view on safe early stopping approaches for determinizing 
a controller that allows us to generalize the techniques of al, reducing the size of 
the resulting DTs even more. 


Example 4. Consider the following controller: C (s1) = {a, b,c}, C (s2) = {a, b, d}, 
C(s3) = {x,y}. All three states map to different sets of actions, and thus an 
impurity measure like entropy penalizes grouping sı and s2 the same as grouping 
sı and s3. However, if determinization is allowed, grouping sı and s2 need not be 
penalized at all, as these states agree on some actions, namely a and b. Grouping 
sı and s2 into the same child node thus allows the algorithm to stop early at that 
point and return a leaf node with {a,b}, in contrast to grouping sı and s3. A 


Knowing that we want to determinize by safe early stopping affects the 
predicate selection process. Intuitively, sets of states are more homogeneous the 
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more actions they share. We want to take this into account when calculating the 
impurity of predicates. One way to do this would be to calculate the impurity of 
all possible determinization functions and pick the best one. This, however, is 
infeasible, hence we propose the heuristic of multi-label impurity measures. These 
impurity measures do not only consider the full set of allowed actions in their 
calculation, but instead they depend on the individual actions occurring in the 
set. This allows the DT construction to pick better predicates, namely those 
whose resulting children are more likely to be determinizable. In |5| Appendix E] 
we formally derive the multi-label variants of entropy and Gini-index. 

To conclude this section, we point out the key difference between the new 
approach of multi-label impurity measures and the previous idea that was intro- 
duced in al. The approach from |4| does not evaluate the impurity of all possible 
determinization functions, but rather picks a smart one — that of maximum 
frequency (MaxFreq) — and evaluates according to that. MaxFreq determinizes in 
the following way: for every state, it selects from the allowed actions that action 
occurring most frequently throughout the whole controller. This way, many states 
share common actions. This is already better than pre-processing, as it does not 
determinize the controller a priori, but rather considers a different determinization 
function at every node. However, in every node we calculate the impurity for 
several different predicates, and the optimal choice of determinization function 
depends on the predicate. Thus, choosing a single determinization function for 
a whole node is still too coarse, as it is fixed independent of the considered 
predicate. We illustrate the arising problem in the following Example 


Fig. 5: A simple example of a dataset that is split suboptimally by the MaxFreq approach 
from g but optimally by the new multi-label entropy approach. 


Example 5. Figure |5| shows a simple controller with a two-dimensional state 
space. Every point is labeled with its set of allowed actions. 

As c is the most frequent action, MaxFreq determinizes the states (1,2), 
(1,3), (2,2) and (2,3) to action c. Hence the red split (predicate y < 1.5) is 
considered optimal, as it groups together all four states that map to c. The blue 
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split (predicate x < 1.5) is considered suboptimal, as then the data still looks 
very heterogeneous. So, using MaxFreq, we need two splits for this controller; 
one to split of all the c’s and one to split the two remaining states. 

However, it is better to first choose a predicate and then determine a fitting 
determinization function. When calculating the impurity of the blue split, we can 
choose to determinize all states with x = 1 to {a} and all states with xz = 2 to 
{b}. Thus, in both resulting sub-controllers the impurity is 0 as all states agree on 
at least one action. This way, one split suffices to get a complete DT. Multi-label 
impurity measures notice when labels are shared between many (or all) states in 
a sub-controller, and thus they allow to prefer the optimal blue split. A 


7 Experiments 


Experimental setup. We compare three approaches: BDDs, the first version of 
dtControl from and dtControl 2.0. For BDD¢| the variable ordering is 
important, so we report the smallest of 20 BDDs that we constructed by starting 
with a random initial variable ordering and reordering until convergence. To 
determinize BDDs, we used the pre-processing approach, 10 times with the mini- 
mum norm and 10 times with MaxFreq. For the previous version of dtControl, 
we picked the smaller of either a DT with only axis-aligned predicates or a DT 
with linear predicates using the logistic regression heuristic that was typically 
best in (4). Determinization uses safe early stopping with the MaxFreq approach. 
For dtControl 2.0, we use the multi-label entropy based determinization and 
utilize the categorical predicates for the case studies from probabilistic model 


checking. We ran all experiments on a server with operating system Ubuntu 
19.10, a 2.2GHz Intel(R) Xeon(R) CPU E5-2630 v4 and 250 GB RAM. 


Comparing determinization techniques on cyber-physical systems. Table [I] shows 
the sizes of determinized BDDs and DTs on the permissive controllers of the 
tools SCOTS and Uppaal Stratego that were already used in (4). We see that the 
new determinization approach is strictly better than the previous one, with only 
two DTs being of equal size, as the result of the previous method was already 
optimal. With the exception of the case studies helicopter and truck_trailer where 
BDDs are comparable or slightly better, both approaches using DTs are orders 
of magnitude smaller than BDDs or an explicit representation of the state-action 


mapping. 


Case studies from probabilistic model checking. For Table [2] we used case studies 
from the quantitative verification benchmark set 24], which includes models from 
the PRISM benchmark suite [83]. Note that these case studies contain unordered 
enumeration-type state-variables for which we utilize the new categorical predi- 
cates. To get the controllers, we solved the case study with STORM and exported 
the resulting controller. This export already eliminates unreachable states. The 


“ Our implementation of BDDs is based on the dd python library https ://github 


com/tulip-control/dd 
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Table 1: Controller sizes of different determinized representations of the controllers 
from SCOTS and Uppaal Stratego. “States” is the number of states in the controller, 
“BDD” the number of nodes of the smallest BDD from 20 tries, dtControl 1.0 the 
smallest DT the previous version of dtControl could generate and dtControl 2.0 the 
smallest DT the new version can construct. “TO” denotes a failure to produce a result 
in 3 hours. The smallest numbers in each row are highlighted. 


Case study States BDD = dtControl 1.0 dtControl 2.0 
cartpole 271 127 11 7 
10rooms 26,244 128 7 7 
helicopter 280,539 870 221 123 
cruise-latest 295,615 1,448 3 3 
dede 593,089 381 9 5 
truck_trailer 1,386,211 18,186 42,561 31,499 
traffic 30m 16,639,662 TO 127 97 


previous version of dtControl was not able to handle these case studies, so we 
only compare dtControl 2.0 to BDDs. 

Table[2|shows that also for case studies from probabilistic model checking, DTs 
are a good way of representing controllers. The DT is the smallest representation 
on 13 out of 19 case studies, often reducing the size by an order of magnitude 
compared to BDDs or the explicit representation. On 3 case studies, BDDs are 
smallest, and on 2 case studies, both the DT and the BDD fail to reduce the size 
compared to the explicit representation. This happens if there are many different 
actions and thus states cannot be grouped together. A worst case example of this 
is a model where every state has a different action; then, a DT would have as 
many leaf nodes as there are states, and hence twice as many nodes in total. 


Remark 2. Note that the controllers exported by STORM are deterministic, so no 
determinization approach can be utilized in the DT construction. We conjecture 
that if a permissive strategy was exported, dtControl 2.0 would benefit from 
the additional information and be able to reduce the controller size further as for 
the cyber-physical systems. 


8 Conclusion 


We have presented a radically new version of the tool dtControl for representing 
controllers by decision trees. The tool now features a graphical user interface, 
allowing both experts and non-experts to conveniently interact with the decision 
tree learning process as well as the resulting tree. There is now a range of 
possibilities on how the user can provide additional information. The algebraic 
predicates provide the means to capture the (often non-linear) relationships from 
the domain knowledge. The categorical predicates together with the interface 
to probabilistic model checkers allow for efficient representation of strategies for 
Markov decision processes, too. Finally, the more efficient determinization yields 
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Table 2: Controller sizes of different representations of controllers from the quantitative 
verification benchmark set [24], i.e. from the tools STORM and PRISM. “States” is the 
number of states in the controller, “BDD” the number of nodes of the smallest BDD of 
20 tries and dtControl 2.0 the smallest DT we could construct. The smallest numbers 
in each row are highlighted. 


Case study States BDD dtControl 2.0 
triangle-tireworld.9 48 51 23 
pacman.5 232 330 33 
rectangle-tireworld.11 241 498 373 
philosophers-mdp.3 344 295 181 
firewire_abst.3.rounds 610 61 25 
rabin.3 704 303 27 
ij.10 1,013 436 753 
zeroconf.1000.4.true.correct_max 1,068 386 63 
blocksworld.5 1,124 3,985 855 
cdrive.10 1,921 5,134 2,401 
consensus. 2.disagree 2,064 138 67 
beb.3-4.LineSeized 4,173 913 59 
csma.2-4.some_before 7,472 1,059 103 
eajs.2.100.5.ExpUtil 12,627 1,315 153 
elevators.a-11-9 14,742 6,750 9,883 
exploding-blocksworld.5 76,741 34,447 1,777 
echoring. MaxOfflinel 104,892 43,165 1,543 
wlan_dl.0.80.deadline 189,641 5,738 2,563 
pnueli-zuck.5 303,427 50,128 150,341 


very small (possibly non-performant) controllers, which are particularly useful 
for debugging the model. 

We see at least two major promising future directions. Firstly, synthesis 
of predicates could be made more automatic using mathematical reasoning on 
the domain knowledge, such as substituting expressions with a certain unit of 
measurement into other domain equations in the places with the same unit of 
measurement, e.g. to plug difference of two velocities into an equation for velocity. 
Secondly, one could transform the controllers into possibly entirely different 
controllers (not just less permissive) so that they still preserve optimality (or 
yield ¢-optimality) but are smaller or simpler. Here, a closer interaction loop 
with the model checkers might lead to efficient heuristics. 
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Abstract. We present HLola, an extensible Stream Runtime Verification (SRV) 
tool, that borrows from the functional language Haskell (1) rich types for data 
in events and verdicts; and (2) functional features for parametrization, libraries, 
high-order specification transformations, etc. 

SRV is a formal dynamic analysis technique that generalizes Runtime Verifica- 
tion (RV) algorithms from temporal logics like LTL to stream monitoring, al- 
lowing the computation of verdicts richer than Booleans (quantitative values and 
beyond). The keystone of SRV is the clean separation between temporal depen- 
dencies and data computations. However, in spite of this theoretical separation 
previous engines include hardwired implementations of just a few datatypes, re- 
quiring complex changes in the tool chain to incorporate new data types. Addi- 
tionally, when previous tools implement features like parametrization these are 
implemented in an ad-hoc way. In contrast, HLola is implemented as a Haskell 
embedded DSL, borrowing datatypes and functional aspects from Haskell, re- 
sulting in an extensible engine*. We illustrate HLola through several examples, 
including a UAV monitoring infrastructure with predictive characteristics that has 
been validated in online runtime verification in real mission planning. 


1 Introduction 


Runtime Verification [4, 14, 18] is a dynamic technique that studies (1) how to generate 
monitors from formal specifications, and (2) algorithms to monitor the system under 
analysis, one trace at a time. Early RV specification languages were based on logics 
like past LTL [19] adapted to finite traces [5, 10, 15], regular expressions [23], fix-point 
logics [1], rule based languages [3], or rewriting [21]. Verdicts and many times observa- 
tions in most of these specification logics are restricted to Booleans, often because most 
early logics in RV were borrowed from static verification—where decidability is cru- 
cial. SRV [9,22] attempts to generalize these monitoring algorithms to richer datatypes, 
including in observations and verdicts. SRV offers declarative specifications where off- 
set expressions allow accessing streams at different moments in time, including future 
instants. Most previous SRV developments [9, 11] and their extensions to event-based 


* This work was funded in part by the Madrid Regional Government under project “S2018/TCS- 
4339 (BLOQUES-CM)”, by Spanish National Project “BOSCO (PGC2018-102210-B-100)”. 
4 The tool is available open-source at http://github.com/imdea-software/hlola 
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systems [8, 11,12,17] focus on efficiently implementing the temporal engine, promising 
that new datatypes can be incorporated easily. However, in practice, adding a datatype 
requires modifying the parser, the internal representation and the runtime system. Con- 
sequently, existing tools only support a limited hardwired collection of datatypes (typi- 
cally Booleans and numeric types for quantitative monitoring). 

In this paper we demonstrate the tool HLola, whose core language is Lola [9], 
but that enables arbitrary datatypes. HLola is implemented as an embedded DSL in 
Haskell. Other RV tools implemented as eDSLs include [2, 13] (in Scala), and [24] 
which implements LTL as an eDSL in Haskell. The main theoretical novelty of HLola 
is a technique called lift deep embedding, that consists in borrowing types transparently 
from Haskell and embedding the resulting language back into Haskell (see [7] for an in- 
troduction to HLola with details of the theoretical underpinnings). In fact, most HLola 
datatypes were introduced after the temporal engine was completed without requiring 
any re-implementation. An eDSL enables higher-order functions to describe transfor- 
mations that produce stream declarations from stream declarations, enabling stream 
parametrization for free. HLola libraries collect these transformers so new logics like 
LTL, MTL, etc with Boolean and quantitative semantics can be implemented in a few 
lines (see Section 2). Haskell type-classes enable simplifiers, which can anticipate the 
value of an expression without requiring the computation of all its sub-expressions. 
Implementing these in previous systems requires to re-invent and implement features 
manually (like macro expansions, etc). HLola even allows specifications as data to im- 
plement “specifications within specifications” (a feature that allows computing a full 
auxiliary specification at every instant, useful in simulation and for nested properties). 
This is used in an UAV scenario to implement Kalman filters [16] as monitors that 
predict the trajectory of the unmanned aircraft. The output of this monitor is used to 
anticipate problems (using another monitor) and take preventive planning actions. 


Stream Runtime Verification in a nutshell SRV generalizes monitoring algorithms to 
arbitrary data, where datatypes are abstracted using multi-sorted first-order interpreted 
signatures (called data theories in the Lola terminology). The signatures are interpreted 
in the sense that every functional symbol f used to build terms of a given type is accom- 
panied with an evaluation function f (the interpretation) that allows the computation of 
values (given values of the arguments). A Lola specification (I, O, Æ} consists of (1) a 
set of typed input stream variables J, which correspond to the inputs observed by the 
monitor; (2) a set of typed output stream variables O which represent the outputs of 
the monitor as well as intermediate observations; and (3) defining equations, which as- 
sociate every output y € O with a stream expression Ey that describes declaratively 
the intended values of y. The set of stream expressions of a given type is built from 
constants and function symbols as constructors (as usual), and also from offset expres- 
sions of the form s[k,d] where s is a stream variable, k is an integer number and d is 
a value of the type of s used as default. For example, altitude [-1, 0.0m] repre- 
sents the value of stream altitude in the previous step of time, with 0 . Om as default 
value to be used at the initial instant. Online efficient algorithms can be synthesized for 
specifications with (bounded) future accesses [9,22], where efficiency means that re- 
sources (time and space) are independent of the length of the trace and can be calculated 
statically. HLola can be efficiently monitored in a trace-length independent sense [7]. 
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2 The HLola Tool 


Fig. 1 shows the software architecture of HLola. We start from an HLola specification, 
which can borrow datatypes, notation and features from the Haskell language (repre- 
sented by the red dashed arrow in Fig. 1). A simple translator processes the specification 
and generates code in the Haskell eDSL. The translator does not fully parse the spec 
and only preforms simple rewrites, leaving most of the specification unchanged. The 
resulting code is combined with the HLola engine (developed in Haskell) and compiled 
into a binary in the target platform. A well-known downside of this approach is that 
during the second compilation stage, error reports may be rather cryptic. On the other 
hand, a Haskell expert can write specifications directly in the embedded DSL, which 
still resembles Lola, to finely tune an HLola specification. 

The enhanced capabilities of HLola with respect to Lola (streams as data, stream 
type polymorphism and parametric streams) impact the syntax of the language, which 
diverges slightly from the syntax of the original Lola. HLola files can either be libraries 
or specifications: Libraries include HLola code that define streams and facilities to cre- 
ate streams, and must be declared using library <Name> (where <Name> is the name 
of the library) on the first line of the HLola file. Specifications first state the format for 
input and output events as format JSONor format CSV. Source files then can import 
libraries and stream data manipulation facilities (called theories) with the statements 
use library <Name> and use theory <Name> respectively. HLola files can also 
import arbitrary Haskell libraries using the statement use haskell <Name>, and in- 
clude Haskell code directly anywhere within the blocks delimited between #HASKELL 
and #ZNDOFHASKELL. Specifications then define the input and output streams. An Input 
stream is declared by its type and name in a line of the form input <Type> <name>, 
just like in the original Lola language. The syntax of <Type> follows the Haskell no- 
tation. An Output stream is specified by its type, name and parameters on the left hand 
side of =, and its defining expression on the right hand side of =: 


output <TypeConstraints>? <Type> <name> <args>*x = <Expr>, 


where <TypeConstraints> is an optional set of constraints over the polymorphic 
types handled by the stream (expressed in Haskell notation), and <args> is an optional 
list of arguments of the form <Type> <name>. We can use define instead of output 
to define intermediate streams, whose values are not reported by the monitor but can 
be used by other streams. The defining <Expr> of an output stream allows the use of 
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Fig. 1. Software Architecture of HLola. 
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let clauses, where blocks, type annotation, do notation, etc. The access to the value 
of a stream s at the current instant uses the term s [now] to distinguish it from s, the 
stream itself (whose type is stream of values). The offset expression that accesses a 
stream s at a shift of 7 with default value d is written as s [i]d], as in classic Lola. The 
symbol ’ is used to lift an object o from the theory as in ‘0. We sometimes indicate 
the arity of the object o being lifted for clarity or to aid the type inference as in 2’ o. To 
improve readability, some operators have been overridden by their lifted version, such 
as if-then-else. 


Libraries. The following HLola file defines a library of Past-LTL operators, called LTL, 
as part of the HLola distribution>. 


library LTL 

use library Utils 

output Bool historically <Stream Bool p> = p[now] && historically p [-1]|’True] 
output Bool once <Stream Bool p> = p[now] || once p[-1|’False] 


output Bool since <Stream Bool p> <Stream Bool q> = qg[now] | | 
(p[now] && p ‘since* q [-1]|’False]) 


output Int nFalses <Stream Bool p> = nFalses p[-1]0] + if p[now] then 0 else 1 


output Double percFalses <Stream Bool p> = nFalses p[now] ‘intdiv* (instantN[now] ) 


The auxiliary library Utils includes instantN, which stores the current instant num- 
ber. Stream historically is parametrized by Boolean stream p. Once instantiated, 
historically p will be true until p becomes false for the first time, and will be 
false thereafter. This definition uses offsets to define the unrolling, using the constant 
value true in the first instant, lifted from Haskell as ’ True. This library also contains 
quantitative operators like nFalses, that counts the total number of falsifications up to 
an instant, and percFalses that calculates the ratio of falsifications. A similar library 
for MTL includes the parametrized definition of p Uça b) Y: 


output Bool until <(Int,Int) (a,b)> <Stream Bool phi> <Stream Bool psi> = from a 
where from a | a == psifal|’ False] 
| otherwise psi[al’False] || (phi[al’True] && from (at+1) 


Here the parametrized stream until takes the interval (a,b) and the streams y and % 
as parameters. Similarly, the library for Quantitative MTL introduces a parametrized 
stream to calculate the arithmetic mean of the last k values of a given stream: 


output Double meanLast <Int k> <Stream Double str> = numr / denom 

where denom=1’fromIntegral (2’min ’k (instantN[now])) ; numr=sumLast k str [now] 
which takes as parameters the window size k and the stream str. The denominator is 
the minimum of k and instantN, converted to Double. The numerator is the sum of 
the last k values in str. Polymorphosim allows us to generalize this definition to any 
Haskell type as long as it is Fractional, Equalizable and Streamable, using the following 
stream signature instead (and the same expression): 


output (Eq a, Fractional a, Streamable a) => a meanLast <Int k> <Stream a str> 


5 All libraries, definitions and examples are available open-source in the GitHub repository and 
at https://software.imdea.org/hlola/specs.html. 
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3 Example Specifications 


In this section we show a collection of HLola specifications to demonstrate the capabil- 
ities of HLola to define stream based monitors. 


Temporal Logics. HLola allows us to easily define, in a declarative way, many specifi- 
cations written in temporal logic. The HLola distribution contains many LTL examples, 
including a sender/receiver model from [6], and other temporal logics. Consider the fol- 
lowing MTL property from [20]: O(alarm — (jo,10)allClear V }j10,10)sShutdown)), 
which includes deadlines between environment events and the corresponding system 
responses, stating that that an alarm is followed by a shutdown event in exactly 10 time 
units unless allClear is received. This is defined in HLola as follows: 


format JSON 
use library MTL 


#HASKELL 

data Event = Alarm | AllClear | ShutDown deriving (Generic, Read, FromJSON, Eq) 

#ENDOFHASKELL 

input Event event 

define Bool allClear = event [now] === 'AllClear 

define Bool shutdown = event [now] === ’ Shutdown 

define Bool alarm = event [now] === ’Alarm 

output Bool property = alarm [now] ‘implies* (willClear[now] || willShutdown [now] ) 
where willClear = eventually (0,10) allClear 


willShutdown = eventually (10,10) shutdown 


Pinescript example. Trading View is an online charting platform for stock exchange, 
which offers the Pinescript language to query stock time series. Pinescript queries are 
then run in the company’s servers. We have implemented the indicators of Pinescript 
in HLola as a library, and we have implementated a trading strategy using the HLola 
Pinescript library. Compared to Pinescript, HLola offers formal semantics, runtime re- 
source guarantees (time and space) and is much more expressive, for example allowing 
relational queries that involve multiple stocks (their averages, etc). 


UAV specifications. We have used HLola also for the online monitoring of several 
properties of UAVs missions. For example: (1) That the UAV does not fly over for- 
bidden regions, and (2) that the UAV is in good position when it takes a picture. The 
input streams of these two specifications consist of the state of the UAV at every in- 
stant and the onboard camera events to detect when a picture is being captured. This 
specification imports geometric facilities from theory Geometry2D, and Haskell li- 
braries Data .Maybe and Data. List. It then defines custom datatypes to retrieve data 
from the UAV, which are enclosed in a verbatim HASKELL block. The output stream 
all_ok_capturing assesses that, whenever the vehicle is taking a picture, the height, 
roll and pitch are acceptable and the vehicle is near the target location. The output 
stream flying_in_safe_zones reports if the UAV is flying outside every forbid- 
den region. The output stream depth_into_poly takes the minimum of the distances 
between the vehicle position and every side of the forbidden region inside which the 
vehicle is. 


6 Available at www.tradingview.com/script/DushajXt- MACD-Strategy 
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format JSON 

use theory Geometry2D 
use library Utils 

use haskell Data.Maybe 
use haskell Data.List 


#HASKELL 

data Attitude = Attitude {yaw :: Double, roll :: Double, pitch :: Double} 
deriving (Show, Generic, Read, FromJSON, ToJSON) 

data Target = Target {x :: Double, y :: Double, num_wp :: Double} wink 

data Position = Position {x :: Double, y :: Double, alt :: Double} ... 

#ENDOFHASKELL 

input Attitude attitude 

input Vector2 velocity 

input Position position 

input Double altitude 

input Target target 

input [[[Double]]] nofly 

input [String] events_within 


output Bool all_ok_capturing = capturing [now] ‘implies ` 
(height_ok [now] && near [now] && roll_ok [now] && pitch_ok [now] ) 


output Bool flying_in_safe_zones = ’isNothing (flying_in_poly [now] ) 


output (Maybe Double) depth_into_poly = let 
mSides = ’(fmap polygonSides) (flying_in_poly [now]) 
distance_from_pos = ‘shortestDist (filtered_pos [now] ) 
in 2' fmap distance_from_pos mSides 
where shortestDist x = minimum.map (distancePointSegment x) 


define Bool capturing = 


define Double filtered_pos_component <(Position->Double) field> <String nm> = 


define Double filtered pos_x filtered_pos_component x "x" [now] 


define Double filtered_pos_y filtered_pos_component y "y" [now] 


define Double filtered_pos_alt filtered_pos_component alt "alt" [now] 


define Point2 filtered pos = 'P (filtered_pos_x [now]) (filtered_pos_y [now] ) 
define Bool near = let target_pos = ‘'targetToPoint (target [now]) 

in 2’ distance (filtered_pos [now]) target_pos < 1 

where targetToPoint (Target x y __) =P xy 


define Bool height_ok = filtered_pos_alt [now] > 0 

define Bool roll_ok = ’(abs.roll) (attitude [now]) < 0.0523 
define Bool pitch_ok = ’ (abs.pitch) (attitude [now]) < 0.0523 
define [Polygon] no_fly polys = 


define (Maybe Polygon) flying_in_poly = let 
position_in_poly = ’pointInPoly (filtered_pos [now]) 
in 2’find position_in_poly (no_fly_polys [now] ) 


Intermediate stream capturing captures whether the UAV is taking a picture (omitted 
for brevity). The streams filtered_pos_alt and filtered_pos represent the loca- 
tion and altitude of the UAV filtered to reduce noise from the sensors. We omit the defi- 
nition of the filter, which is implemented in £iltered_pos_component The streams 
height_ok, roll_ok, and pitch_ok, calculate that the corresponding attitude of the 
vehicle is within certain boundaries. Finally, the intermediate stream no_fly_polys 
obtains a set of Polygons from the input forbidden regions (its definition has been omit- 
ted), and the stream £lying_in_poly returns the forbidden region in which the vehi- 
cle is flying, if any. The artifact attached to this paper includes more UAV specifications, 


which have been validated in real missions [25]. 
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Abstract. AMULET 2.0 is a fully automatic tool for the verification of 
integer multipliers using computer algebra. Our tool models multiplier 
circuits given as and-inverter graphs as a set of polynomials and applies 
preprocessing techniques based on elimination theory of Grébner bases. 
Finally it uses a polynomial reduction algorithm to verify the correctness 
of the given circuit. AMULET 2.0 is a re-factorization and improved re- 
implementation of our previous multiplier verification tool AMULET 1.0. 


1 Introduction 


Formal verification of arithmetic circuits is important to prevent issues like the 
famous Pentium FDIV bug [28]. Up to now there have been many attempts to 
verify these circuits, but even today the problem of fully automatic verification 
of arithmetic circuits, and especially multipliers, is still considered to be hard. 

Methods based on decision diagrams [6] rely on manual structural decomposi- 
tion of the multiplier. Approaches based on satisfiability checking (SAT) are not 
scalable [3]. Recently progress has been made using theorem provers [29]. How- 
ever, the multipliers have to be given as SVL netlists, which relies on preservation 
of hierarchical information.For flattened gate-level multipliers the currently most 
successful technique uses algebraic reasoning [7, 15,17, 25,26]. In this line of 
work the circuit is modeled as a set of polynomials and the specification is then 
checked to be implied by the circuit polynomials. For non-experts Chap. 2 of [15] 
might serve as introduction to bit-level verification using computer algebra. 

In our approach [17] we apply a combination of SAT solving and computer 
algebra. Certain parts of the multiplier, i.e., complex final stage adders that 
are generate-and-propagate (GP) adders [27], are hard to verify using computer 
algebra, but are easy to verify using SAT solvers [21]. Therefore we apply adder 
substitution [17] and replace complex final stage adders by simple ripple-carry 
(RC) adders. The equivalence of the adders is verified using SAT solvers. The 
correctness of the simplified multiplier is shown using computer algebra [17]. 

This tool paper presents AMULET 2.0, a successor of AMULET 1.0 [17,19]. 
AMULET 2.0 reads multipliers given as and-inverter graphs (AIG) [22] and 
fully automatically applies adder substitution and verifies the (simplified) circuit. 
Furthermore, certificates can be generated in the Nullstellensatz proof format [16] 
or in the practical algebraic calculus (PAC) [20] to validate the verification results. 
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AMULET 2.0 is a modular C++ re-implementation of AMULET 1.0 (while 
AMULET 1.0 consists of a single C file). AMULET 2.0 is not only a stan- 
dalone tool but also serves as a polynomial reasoning framework, i.e., parts can 
easily be integrated into different workflows, cf. Sect. 4. AMULET 2.0 still pro- 
vides the same functionality as AMULET 1.0, but with improved algorithms, cf. 
Sect 5, based on the same theory [15,17]. In this paper we focus on novelties of 
AMULET 2.0 and refer the reader to [19] for an introduction to AMULET 1.0. 


2 Circuit Verification using Computer Algebra 


AMULET 2.0 takes as input signed or unsigned integer multipliers C, given 
as AIGs, with 2n input bits ao,.-.,@n—1, bo,---,0n—1 € {0,1} and output bits 
S0,- --,S2n—1 € {0,1}. We denote the internal AIG nodes by l4,...,l € {0,1}. 
Let ZX | = Z\ao, -++,Qn-1; bo, Arar Onis l, kik li S0; <3 S2n—1]. The multiplier 
C is correct iff for all possible inputs a;, bi € {0, 1} the specification £ = 0 holds: 


2n—1 n-1 


L=- > 2's, + (£ 2a) (x zw) (1) 


For signed multipliers the most significant bits s2,-1, @n—1, and b,_1 deter- 
mine the sign and the weights have to be negated, i.e., 27"~! becomes —2?”~?. 

The semantics of each AIG node implies a polynomial relation, e.g., u = vA7w 
implies —u + v — vw = 0. Let G(C) C Z[X] be the set of polynomials that 
contains for each AIG node the corresponding polynomial relation. Additionally, 
all variables x € X are Boolean and we enforce this property by the set of 
Boolean value constraints B(X) = {x(1— x) | x € X} C Z[X]. The polynomials 
in G(C) U B(X) are ordered according to a lexicographic order, such that the 
output variable of a gate is always greater than the inputs of the gate [23]. 

Let J(C) = (G(C) U B(X)) C Z[X] be the ideal generated by G(C) U B(X). 
The circuit fulfills its specification if and only if we can derive that £ € J(C) [17]. 
We showed in [17] that G(C) U B(X) is a D-Grébner basis [2] for J(C) C Z[X]. 
Thus, the correctness of the circuit can be established by reducing £ by the 
polynomials G(C) U B(X) and checking whether the result is zero. 

However, simply reducing the specification by G(C) U B(X) leads to large 
intermediate results [24]. Hence, we eliminate variables in G(C’) U B(X) prior 
to reduction to yield a more compact D-Grobner basis [17], which boils down to 
simple substitutions, but relies on the elimination theorem of Gröbner bases [9]. 


3 Usage 


AMULET 2.0 is available at http://fmv.jku.at/amulet2 and is published as open 
source under the MIT license. AMULET 2.0 relies on the AIGER library [5] and 
the GMP library [10]. The AIGER library is provided together with the source 
code of AMULET 2.0, the GMP library needs to be pre-installed by the user. 
AMULET 2.0 is compiled executing “./configure.sh && make”. 
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In a complete workflow one should first apply adder substitution, using the 
substitution mode of AMULET 2.0, to make sure that a potential complex final 
stage adder is replaced by a simple RC adder. Afterwards, one of the two 
modes, the verification mode or certification mode, can be applied to verify the 
(simplified) multiplier, which we will call in the following rewritten multiplier. If 
it is known that the final stage adder is not a complex GP adder, the substitution 
step can be omitted. We present a complete demonstration for the unsigned 64-bit 
multiplier <bpwtcl.aig>, which is included in the complementary material [14]. 
The output of AMULET 2.0 can be seen in the corresponding log-files that are 
also included in the artifact. 


Adder Substitution. First we apply adder substitution by running 


./amulet -substitute bpwtcl.aig miter.cnf rewritten.aig [options] 


ab 


If the multiplier computes multiplication of signed integers the option “-signed” 
has to be involved, because the signedness is part of the circuit specification. 

If adder substitution can be applied successfully, the generated miter is written 
to <miter.cnf> and the rewritten multiplier to <rewritten.aig>. Otherwise, 
the input multiplier will be written to <rewritten.aig> and a trivially unsat- 
isfiable CNF is written to <miter.cnf>. The file <miter.cnf> has to be given 
to a SAT solver, e.g. K1ssaT [4], which is then expected to return unsatisfiable. 
The rewritten multiplier can be verified or certified using AMULET 2.0. 


Verification. Verification is executed by 
./amulet -verify rewritten.aig [options] 


As for adder substitution, one has to invoke the option “-signed” for signed mul- 
tipliers. Furthermore, the option “-no-counter-examples” is available, which 
turns off generation and saving of counter examples in <rewritten.cex>, in the 
case when the multiplier in <rewritten.aig> is incorrect. 


Certification. Certification is applied using 
./amulet -certify rewritten.aig out.pol out.prf out.spc [options] 


In this mode, AMULET 2.0 verifies the multiplier and automatically gener- 
ates proof certificates, which can be checked by corresponding proof checkers. 
AMULET 2.0 supports two proof formats, Nullstellensatz proofs [1,16] and PAC 
proofs [20] based on the polynomial calculus [8]. The default proof format is 
the Nullstellensatz proof, because it generates smaller proof files and is faster to 
check. Proofs in the PAC format can be generated using the option “-pac”. All 
options of the verification mode are available too. 

The proofs are stored in the provided files <out.pol>, <out.prf>, and 
<out.spc>. The file <out.pol> contains the gate constraints, the second file 
<out .prf> the core proof in the selected proof format and the third file <out . spc> 
the specification of the multiplier. The generated proofs can be given to the 
proof checkers NuSS-CHECKER [16] for Nullstellensatz proofs or to the proof 
checkers PACHECK [20], or PASTEQUE [20] for PAC proofs. 
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Fig. 1. Architecture of AMULET 2.0. 


4 AMulet 2.0 


In this section we present the architecture of AMULET 2.0 and discuss novel 
optimizations. The design of AMULET 2.0 is shown in Fig. 1. In contrast to 
AMULET 1.0, which consists of one single C file, AMULET 2.0 is split into 
components, which also allows to integrate only parts, e.g., the polynomial library 
or the polynomial solver, in different workflows, cf. the provided demos in the 
artifact [14]. AMULET 2.0 is implemented in C++11 and consists of around 
6000 lines of code. It relies on the AIGER library [5] to process the given AIG 
and the GMP library [10] to represent large integers. 

The mode of AMULET 2.0 is triggered by the command line input, cf. Sect. 3. 
In substitution mode, AMULET 2.0 parses the AIG, allocates the internal 
gate structure, and invokes the substitution engine for adder substitution. In 
verification mode, AMULET 2.0 reads the AIG and initializes the gate structure. 
Afterwards, the circuit is verified in the polynomial solver using polynomial 
operations of the polynomial library. In certification mode proofs are generated in 
addition. In the following we present the individual components of AMULET 2.0. 


Parser Module AMULET 2.0 checks whether the given AIG circuit fulfills 
the requirements described in Sect. 2, i.e., the AIG circuit has an even number 
of inputs and an equal number of outputs. The AIG module wraps functions of 
the external AIGER library that are needed to process the input file. 


Gate Library After parsing we allocate a gate for each AIG node, which 
includes structural information, such as dependencies, or whether the gate rep- 
resents an input/output or an XOR-gate. Furthermore, each gate is linked to 
a unique variable. If the given AIG is verified or certified, AMULET 2.0 also 
initializes the gate constraints and creates the specification polynomial £ € Z[X]. 


Substitution Engine In substitution mode, AMULET 2.0 applies heuristic 
pattern matching to identify GP adders [17]. In AMULET 2.0 we enhanced 
the identification heuristics and cover special cases that are not considered in 
AMULET 1.0. Thus, AMULET 2.0 is able to detect more GP adders than 
AMULET 1.0. After a positive GP pattern match, AMULET 2.0 generates an 
equivalent RC adder and replaces the GP adder by the RC adder. A bit-level 
miter is generated in CNF to verify the equivalence of the adders. The rewritten 
multiplier and the CNF miter are printed to the provided output files. 
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Polynomial Solver The polynomial solver is based on the solving engine 
of AMULET 1.0 [19] and is used to verify or certify the given multiplier. In a 
nutshell, the polynomial solver first applies preprocessing by eliminating selected 
variables. Afterwards, the remaining variables are ordered into column-wise 
slices, such that we can apply our incremental verification algorithm [18], where 
we split the specification £ into multiple polynomials and verify the multiplier by 
deriving the correctness of each slice using polynomial reduction. The necessary 
polynomial operations are implemented in the Polynomial Library. 

In AMULET 2.0 we eliminate variables before ordering them, while in 
AMULET 1.0 it is the other way around. We eliminate all internal gates of the 
XOR-structures and all single-parent nodes in the AIG. Thus, fewer variables 
are considered for ordering, which improves computation time of AMULET 2.0. 

Furthermore, we include a novel XOR-based slicing approach in AMULET 2.0, 
which relies on the fact that many multiplier architectures use XOR-skeletons to 
compute the output bits. We identify these skeletons and assign all nodes of a 
skeleton to the same slice. Gates occurring between XOR-skeletons are assigned 
to the smaller (less significant) slice. Hence, after two iterations all slices are 
fixed, which improves slicing compared to AMULET 1.0. All variables that are 
not assigned to slices, e.g., gates used to compute the partial products in Booth 
encoding [27], are eliminated from the gate structure. 

In few cases, where we cannot identify XOR-skeletons, e.g., in multipliers con- 
taining a carry-select adder, we fall back on the slicing approach of AMULET 1.0: 
We slice based on input cones and eagerly move gates between slices to reduce 
the number of carries, by iterating multiple times over the variables. 

After assigning gates to slices, AMULET 2.0 reduces the slice-wise specifica- 
tions incrementally by the sliced gate constraints and checks whether the final 
result is zero, following the implementation of AMULET 1.0. If the final remain- 
der is not zero, AMULET 2.0 detects counter examples, i.e., input assignments 
for which the multiplier circuit computes an incorrect result. 

In certification mode, AMULET 2.0 tracks polynomial operations in the 
selected proof format, i.e., Nullstellensatz or PAC, and prints gate constraints, 
the generated proof, and the specification £ to the provided files. 


Polynomial Library The polynomial library implements the arithmetic oper- 
ations for addition and multiplication of polynomials (by constants), and division 
by terms. Since all variables represent Boolean values, we always reduce expo- 
nents greater than one automatically to one, i.e., we assume 7g: g = T. 

Polynomials are represented as linked lists of monomials. Each monomial 
consists of a coefficient, represented using the GMP library, and a term. Terms 
are linked lists of variables, which are internally shared using a hash table. 

In AMULET 1.0 we do not share monomials and make hard copies in the 
few occasions when a monomial needs to be copied. This has the benefit that 
we can simply modify coefficients of the monomials, e.g., during addition. In 
our experiments we observed that allocating new GMP objects is actually quite 
time consuming, and therefore we now share monomials in AMULET 2.0, using 
reference counting, which decreases verification time by a factor of two. 
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Fig. 2. Verification of AOKI multipliers (left) and of large multipliers (right), in seconds. 
5 Evaluation 


In our experiments we use an Intel Xeon E5-2620 v4 CPU at 2.10 GHz (with turbo- 
mode disabled) with a memory limit of 128 GB. The time is listed in seconds 
(wall-clock time). We compare AMULET 2.0 to our previous tool AMULET 1.0 
and to the most recent related work RevSCA, RevSCA-2.0 [25] and ABC-based 
work of [7] on multiplier verification using computer algebra, where circuits are 
given as AIGs. The tool of [26] is not yet available. We consider two versions 
of AMULET 1.0: (i) AMULET 1.0 as published in [17], (ii) AMULET 1.5 a 
slightly improved version [13] with new heuristics for detecting GP adders. The 
experimental data is included in the artifact [14]. 

In our first experiment we consider the comprehensive AOKI benchmark 
set [12], which provides 384 signed and unsigned integer multiplier architectures 
up to input bit-width 64, also covering Booth encoding. We consider all 384 
architectures of bit-width 64. The time limit is set to 300 seconds. The results 
are shown on the left side of Fig. 2, where it can be seen that AMULET 2.0 
is the only tool that is able to verify the complete benchmark set. RevSCA 
only supports verification of unsigned integers. ABC-based work of [7] uses an 
optimization, which only works for simple multiplier architectures. Enabling this 
optimization on the more involved AOKI benchmarks leads to incompleteness. 
Without enabling it [7] either produces a segmentation fault or exceeds the time 
limit. Thus there are no results for [7] on the left side of Fig. 2. 

In our second experiment we generate benchmarks of simple multipliers up 
to input size 2048, using scripts by Arist Kojevnikov [11]. The time limit is set 
to 86 400 seconds (24 h) and the results are shown on the right side of Fig. 2. It 
can be seen that AMULET 2.0 outperforms all competitor tools and is an order 
of magnitude faster on large multiplier circuits. 


6 Conclusion 


We presented AMULET 2.0, a fully automatic tool for verifying multiplier circuits 
given as AIGs. AMULET 2.0 is a re-factorization and re-implementation of our 
previous verification tool AMULET 1.0 [17,19] and successfully verifies a large 
set of multiplier architectures. In the future we want to directly integrate a SAT 
solver into AMULET 2.0 and provide language bindings, e.g. for Python. 
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Abstract. This paper is about shipping runtime verification to the 
masses. It presents the crucial technology enabling everyday car owners 
to monitor the behaviour of their cars in-the-wild. Concretely, we present 
an Android app that deploys RTLOLA runtime monitors for the purpose of 
diagnosing automotive exhaust emissions. For this, it harvests the avail- 
ability of cheap bluetooth adapters to the On-Board-Diagnostics (OBD) 
ports, which are ubiquitous in cars nowadays. We detail its use in the con- 
text of Real Driving Emissions (RDE) tests and report on sample runs 
that helped identify violations of the regulatory framework currently 
valid in the European Union. 


1 Introduction P” Google Play 


In the last decade, far more than 600 million cars have entered the streets world- 
wide [10]. With very few exceptions, each of these cars is equipped with a stan- 
dardized On-Board-Diagnostics (OBD [16]) interface. Five years ago it surfaced 
that many of the cars out there do not adhere to the regulatory framework with 
which they are supposed to comply. For example, a number of undeniable proofs 
of tampered emission cleaning systems in passenger cars [5,3,14] are known by 
now. When this scandal first surfaced, the regulations imposed by the authori- 
ties were related to isolated tests carried out under lab-like conditions on chassis 
dynamometers [20,4]. Since then, there has been a growing understanding that 
emission and fuel or battery consumption measurements should best take place 
in a realistic context. Hence, the first test framework for testing on public roads, 
the Real Driving Emissions (RDE) test has been developed [19,17] and is being 
rolled out for car model approval in Europe and other entities of jurisdiction. 
The RDE regulation specifies the conditions under which a car trip qualifies 
as a valid RDE test. These conditions refer to the trajectory driven, duration, 
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altitudes, speeds, and on the dynamics of the driving profile [17]. By combin- 
ing the information available at the OBD port and the position of the car, it is 
possible to cast RDE testing into a runtime monitoring [21,13,12] problem. In- 
deed we have shown in earlier work [9] how to formalize the RDE regulations in 
RTLOLA [7,8], a real-time extension of the stream-based specification language 
Lola [6]. Lola combines the ease-of-use of rule-based specification languages with 
the expressive power of heavy-weight scripting languages or temporal logics. The 
eponymous framework generates runtime monitors for such specifications, which 
were successfully deployed, for instance, on unmanned aircraft [18,2]. 

An official RDE test requires a calibrated portable emissions measurement 
system (PEMS) to be connected to the car’s exhaust pipe while driving the test, so 
as to correctly quantify the amount of exhaust emissions induced. The purchasing 
costs of a PEMS are in the order of €250,000 which is close to unaffordable even 
in a research context. However, many car models expose a variety of diagnosis 
data through OBD and an OBD-to-Bluetooth adapter can be purchased for around 
€10. The data exposed depends on the type of engine, emission cleaning system, 
and other components in use. There are several minimal combinations of OBD 
data giving good approximations of emitted gases. In particular, various car 
models expose the sensor readings of their after-treatment NO, sensor deployed 


at the rear of the exhaust pipe. 
Data 


Drive 
Record 


Contribution. This paper presents LOLADRIVES, an An- 
droid app enabling car owners to carry out real driv- 
ing emission tests with little investment. Prerequisites 
are (i) an Android phone, (ii) an OBD-to-Bluetooth 
adapter, and (iii) a car model that does indeed expose 
the needed values via OBD. If the latter is not the case, 
the app can still serve the user as a convenient personal 
monitoring and logging device for the many quantities ex- RTLola 


posed while driving. 4 


A structural overview of LOLADRIVES is depicted in RDE Specification 
Figure 1. At the core of the app is an Android version \ — 
of the RTLOLA engine [7]. The engine is strictly separated 
from the data acquisition and the RTLOLA RDE specifica- Fig. 1: LOLADRIVES 
tion. This separation will make it possible to reuse the 
approach in other runtime monitoring contexts, be it of espresso machines via 
USB, or drones via Wi-Fi. In both cases, it would especially be the specification 
in RTLOLA that needs to change, not the engine. Car sensor data is acquired 
via Bluetooth from the OBD device, and combined with location data provided 
by Android’s GPs service. The data streams are recorded for later diagnosis. 
Anticipating future application scenarios involving crowd sourcing car data, we 
advertise the app as part of a car data platform (CDP), which includes an upload 
facility for donating drive records. While driving, the app’s user interface (UI) 
displays diagnostic information to the user, both regarding the correct execu- 
tion of an RDE test drive and the car’s emission data. We will detail the separate 
components of the app next. 
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Notably, the lack of any calibration and the unknown precision of the data 
exposed by the car manufacturer via OBD make it impossible to consider the 
RDE test results reported by LOLADRIVES as anything more than indicators of 
the car’s RDE behaviour in a legal sense. 


2 RDE Monitoring on Android 


The primary feature of LOLADRIVES is to monitor the progress of an RDE test 
drive. For this, it uses the RTLOLA monitoring framework. This bridges the gap 
between formally sound concepts and every-day use cases. While RTLOLA does 
target a broad audience, that audience is still intended to be expert users rather 
than the general public. It requires users to execute three tasks: provide a formal 
specification of the intended behaviour, supply input data, and interpret the 
monitor’s output. LOLADRIVES reduces these tasks to minimal action points for 
end-users. 


Specification. No end-user input is required with respect to the RTLOLA specifica- 
tion. The definition of what is a valid RDE test is fixed [9] and strictly follows the 
constraints imposed by the regulation issued by the European Commission [17]. 
These constraints concern the driving behaviour and layout of the route. Some 
of them apply universally, e.g., the ambient temperature must range between 
273K and 303K. For others, the RDE regulation differentiates three environ- 
ments: urban, rural, and motorway with different environments imposing differ- 
ent restrictions on the car, such as an average velocity between 15 and 40 km/h 
in an urban environment. A segment refers to all parts of the test drive in which 
the car operates in a certain environment. While segments may be interrupted, 
each one needs to occupy a specific share of the total distance travelled. 


Input Data Provision. LOLADRIVES uses sensor readings provided over the OBD 
interface as input data. The user only has to plug the OBD-to-Bluetooth adapter 
in the respective port at (or close to) the dashboard of her car and pair it with 
her phone. The car then automatically transmits data to the phone while driving. 


Interpretation of Output. While driving, LOLADRIVES assists the user in the 
critical task of satisfying all the constraints that make up a valid RDE. It provides 
feedback on the driving behaviour indicating which requirements on the test are 
satisfied to what extent, and which still need attention. Furthermore, it evaluates 
the measured exhaust data and informs the user of whether or not the car violates 
emission regulations. Both of these tasks require an online analysis of driving 
data. For this analysis, LOLADRIVES uses the RTLOLA monitoring framework. 


Foundational Underpinning. RTLOLA [8,7] is a stream-based runtime verifica- 
tion framework. The RTLOLA monitor analyses sequences of input data to assess 
whether or not the system complies with the specification. The specification lan- 
guage has a formal semantics which enables devising provably correct monitoring 
algorithms [15]. 

An RTLOLA specification consists of input stream declarations where each 
input stream corresponds to a source of input data such as the NO, sensor 
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of the car. Output stream declarations then spell out how to filter and refine 
the input data. For this, RTLOLA provides primitives for complex analyses such 
as sliding window aggregation for common aggregation functions. Further, the 
specification contains binary trigger conditions. The satisfaction of such a con- 
dition constitutes a violation of the specification and prompts the monitor to 
immediately relay a warning to the user. The following snippet is an extract of 
an RTLOLA specification for RDE test drives [11]: 


input velo_kmph, accel_mpss: Float64 
output is_rural := ... output rural_avg_velo :=... 
output rural_dyn : Float64 @1Hz filter: is_rural := velo_kmph * 
accel_mpss / 3.6 
output rural_pctl_dyn : Float64 @1Hz := 
rural_dyn.aggregate(over: 7200, using: pct1(95)).defaults(to: 0.0) 
trigger rural_pctl_dyn > (0.136 * rural_avg_velo + 14.44) 
A rural_avg_velo <= 74.6 


This specification fragment checks whether the car complies with the RDE reg- 
ulations regarding the driving dynamics in the rural segment*. The first line 
declares two input streams representing the velocity in kmh~? and acceleration 
in ms? supplied by the car. The third line computes the dynamics in m? s73, by 
multiplying the velocity and acceleration. The regulations then demand that the 
95* percentile of the dynamics are no greater than 0.136- Vavg + 14.44 where Vavg 
is the average velocity of the vehicle. The computation of the velocity and the 
dynamics only consider sensor readings obtained while in the rural segment. The 
full specifications are publicly available [1]. Note that while the specification is 
relatively easy to design and understand for computer scientists and engineers, 
it exceeds the expertise expectable of laymen users. However, it is not neces- 
sary for them to be confronted with the full potential of the language because 
LOLADRIVES comes preconfigured with a set of RDE-specific specifications. 

As can be seen, the requirements on the end-user are minimal. Thus, the setup 
enables users to conduct RDE test drives and assess the emission-behaviour of 
their cars without requiring them to understand the underlying technology. 


3 User Experience 


This section discusses the user perspective on LOLADRIVES. After a general 
overview, we report on the use of LOLADRIVES for conducting RDE test drives 
with a rented vehicle (the precise car model being unknown upfront). 


Overview. The preparation of the test requires the user to plug the OBD-adapter 
into the OBD-port of the car. After starting car and app, LOLADRIVES receives 
data packets and determines the sensor profile of the car, assuming phone and 
adapter are paired via Bluetooth. Some sensor profiles provide insufficient data 
to conduct an RDE test drive. In this case, the app is still convenient to use for 
real-time displaying and logging the available data regardless of RDE regulations, 


t See Annex IIIA, Appendix 7a, 3.1.3 in the EU regulations [17]. 
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data plotted against time. 
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Fig. 2: UI of LOLADRIVES displaying different views and a map of a test route. 


see Figure 2a. If the data suffices, the app selects an appropriate specification and 
initializes the RTLOLA monitor. LOLADRIVES then starts filtering and visualising 
the data output and trigger notifications provided by the monitor. 


After successful setup, the UI switches to an RDE guiding view (Figure 2b). 
From top to bottom, it shows the total time, which must be between 90 and 
120 min to finish the test, and the total distance travelled. The next line indicates 
the current state of the conditions for a valid RDE test drive disregarding emission 
data. In the screenshot, the drive is still in progress and inconclusive, indicated by 
the question mark. Instead, the UI can also indicate success or failure. The latter 
verdict can occur far before the time limit is reached, caused by an irrecoverable 
situation such as transgression of the 160kmh~! speed limit. Note that the 
indicator reports the current status if the test drive were to end in this moment. 
Together with the regulatory constraints, this implies that the current verdict 
can alternate between success and failure from minute 90 to 120. As there is no 
specific point in time when the test ends, the app continues to compute statistics 
until the tester manually stops it or the 120 min mark is reached. Beneath the 
status indicator is the green NO, bar displaying the total NO, emissions. The 
vertical red bar denotes the permitted threshold of 168mg km~!. 

The next three UI groups represent the progress in each of the distinct seg- 
ments: urban, rural, and motorway. Each group consists of two horizontal bars. 
The gray progress bar displays the distance covered in the respective segment. 
The vertical blue indicators denote lower and upper bounds as per official reg- 
ulation, for an expected trip length of 83km. The blue bar below the gray one 
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Drive 1 Drive 2 
Distance NO, CO2 Distance NO, CO2 
[km] [mg/km] [g/km] [km] — [mg/km] [g/km] 
Urban 35.45 137 222 37.46 102 251 
Rural 22.33 305 154 27.40 90 172 
Motorway 26.10 241 153 25.37 105 175 
Total 83.88 214 183 90.22 99 205 


Table 1: Aggregation of the emission data based on the CDP. 


illustrates two different metrics for the driving dynamics. Both dots need to re- 
main below/above their thresholds. A more aggressive acceleration behaviour 
shifts the dots to the right and a passive driving style to the left. 


Test Drive. The technical framework and visual feedback of the app were tested 
in two RDE test drives. Both tests were conducted with an Audi A6 Avant 45-TDI 
hybrid diesel, which is approved as Euro 6d-TEMP (DG) with an NO, threshold 
of 80mg km~! under lab conditions and 168 mg km~! for RDE conditions. Among 
the diagnosis parameters available within this car are vehicle and engine speed, 
ambient temperature, engine fuel rate, mass air flow, and two NO,-sensors— 
one in front and one behind the emission cleaning system in the exhaust pipe. 
With this data, exhaust mass flow and fuel consumption can be computed, from 
which the total amounts of NO, and CO, can be derived [11]. In both drives, the 
driving dynamics were close to the allowed maximum, in the first test below and 
in the second test above the threshold, so the second test drive did not result 
in a valid RDE test. In both cases, the app correctly confirmed the satisfaction 
and violation of the RDE criteria. In the first drive, the app reported an average 
NO, emission of 214mg km~!. This constitutes a violation of the regulation. 
The app also allows for inspection of the driving data in a plotted form (Fig- 
ure 2c). Figure 2d shows the route of an RDE test drive. The first half of the 
time constituted the urban segment (green). The next 30-40% of the test mainly 
consisted of the rural segment (purple) followed by the motorway segment (red). 


Data Harvesting. For further analysis, data can be uploaded to a cloud storage 
which is part of the car data platform (CDP). This platform provides a uniform 
way to harvest data by specifying a format for collection, analysis, and exchange 
of this data. CDP builds upon a JSON format (https: //json-schema.org/) con- 
taining timestamped events such as an OBD response, including its raw payload. 
As an example, the data presented in Table 1 is an aggregation of the RDE test 
drives mentioned above obtained by post-processing the data. 


4 Conclusion 


LOLADRIVES pushes runtime verification technology into cars and phones of 
everyday users. The app is available in Google Play [1]; a version for iOS is 
already initiated. Moreover, the car data platform constitutes a crowd-sourcing 
initiative for car data with the intention to enable large scale analyses of emission 
data beyond a single trip and a single car model. 
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Abstract Statistical model checking uses Monte Carlo simulation to 
analyse stochastic formal models. It avoids state space explosion, but 
requires rare event simulation techniques to efficiently estimate very low 
probabilities. One such technique is RESTART. Villén-Altamirano recently 
showed—by way of a theoretical study and ad-hoc implementation—that 
a generalisation of RESTART to prolonged retrials offers improved per- 
formance. In this paper, we demonstrate our independent replication of 
the original experimental results. We implemented RESTART with pro- 
longed retrials in the FIG and modes tools, and apply them to the models 
used originally. To do so, we had to resolve ambiguities in the original 
work, and refine our setup multiple times. We ultimately confirm the pre- 
vious results, but our experience also highlights the need for precise doc- 
umentation of experiments to enable replicability in computer science. 


1 Introduction 


In stochastic timed systems, the time between faults, customer interarrival times, 
transmission delays, or exponential backoff wait times follow (continuous) prob- 
ability distributions. Probabilistic model checking [3] can compute dependabil- 
ity metrics like reliability and availability in the Markovian case. To evade state 
space explosion and evaluate non-Markovian systems, statistical model check- 
ing (SMC [2]) has become a popular alternative. At its core, SMC is Monte 
Carlo simulation for formal models. It faces a runtime explosion when estimat- 
ing the probability p of a rare event with a sufficiently low error, e.g. an error of 
+1071° for p ~ 107° (i.e. a relative error of 0.1). Rare event simulation (RES) 
techniques [17] address this problem. They can broadly be categorised into im- 
portance sampling and importance splitting. The former changes the probabil- 
ity distributions while the latter changes the simulation algorithm to make the 
rare event more likely. Both techniques then compensate for these changes in 
the statistical evaluation. RES has garnered the interest of mathematicians and 
computer scientists alike. The scientific outcomes range from theoretical studies 
of a RES technique’s limit behaviour and optimality [8,14,16] over experimental 
validation on Matlab studies or ad-hoc implementations [10,11,19] to application 
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reports using larger case studies [5,12,18] as well as automated tools [4,6,15,18] 
that accept a loss of optimality in exchange for practicality. 

Two recent papers showed theoretically [21] and empirically [19] that pro- 
longing retrials in the RESTART importance splitting technique [22] reduces the 
required number of samples for the same error, with optimal runtime around 
prolonging by 1 to 2 levels. The models and parameters used in [19] are de- 
scribed in supplementary material [20], but the implementation is not publicly 
available. In this paper, we demonstrate our replication of the results of [19,21], 
where replication “means that an independent group can obtain the same res- 
ult using artifacts which they develop completely independently” in the ACM 
terminology [1]. To this end, we implemented RESTART with prolonged retrials 
(RESTART-P) in the FIG rare event simulator [4] and the modes statistical model 
checker [7] of the MoDEST TOOLSET [13]. We recreated the models in the IOSA 
and MODEST languages, and ran experiments following the original setup. 

Our experiments confirm the behaviour and performance improvements of 
RESTART-P reported in [19,21]. However, we encountered ambiguities in the tex- 
tual and pictorial descriptions of RESTART-P and the experimental setup in the 
original papers, some of which we could only resolve with input from the author 
of [19,21]. Different parts of our work thus reside on different levels between rep- 
lication and reproduction (which “means that an independent group can obtain 
the same result using the author’s own artifacts” [1]). Throughout the paper, we 
document where we achieved fully independent replication, where information 
from private communication was needed, and where we had to ultimately resort 
to requesting and inspecting the source code for the original implementation. 

The contribution of this paper is thus threefold: (1) We provide pseudocode 
for RESTART-P in Sect. 2 that clarifies the technical details w.r.t. [19,21]. (2) We 
demonstrate the new RESTART-P capabilities of FIG and modes by replicating 
the original experiments in Sect. 3. (3) We reflect on our experience (as prac- 
tical computer scientists) in independently replicating existing (theoretically- 
flavoured) work. 


2 RESTART with Prolonged Retrials 


Let a stochastic timed discrete-event model be given as a tuple (S, so, step, F) 
of a set of states S, an initial state sọ € S, a function step: S — [0,co) x S 
where step(s) samples a random path from s to the next event and returns a 
pair (t, s’) of its duration and next state, and a subset of rare event states F C S. 
A simulation run is a sequence of states obtained by repeatedly applying step. 
Models with general probability distributions encode their memory in the states. 

Importance splitting uses an importance function fr: S — [0, 00) indicating 
“how close” a state is to the rare event. Partition the range of fr into k +1 non- 
empty intervals to obtain a level function fr: S —> {0,...,k} with fr(s1) < 
fr(s2) => fr(si) < fr(s2). For simplicity, assume f7(so) = 0 and step(s) = 
(t, s') => fr(s’) < fr(s)+1 (a step moves up by at most one level). Let C; = { s | 
fr(s) >i}. Then “thresholds T; of fr are defined so that each set C; is associated 
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Input: model (S, so, step, F}, fr, fs, prolongation depth j, max. sim. time Tmas 


tr := 0, list € := {| (so, 0,0, 0) |} // (state, time, creation level, last-split level) 

while £ 4 Ø do // run all trials to end 

(s, t, Lereate, split) := €.get-remove() // data of current trial 
while t < Tmar do 

(t', s’) := step(s) // simulate to next change in state 

t= min t Tce = th FS ttt // advance time, at most to Tmas 


if s € F then tr := tp + ei s(i) // accumulate weighted rare time 
(£, L’) = (fx(s), fr(s’)), s := 8 


if l < £ then // trial went down: 
if 0’ = 0 = Lereate then Lsput := 0 // reset main trial at level 0 
else if l =0V Ll < Lereate — j then break // end retrial if 0 or j down 
else Lsprit := min(lspiit, C+ j) // else update last-split level 
else if l > lsp1x then // trial went up far enough: 
Leptit = £ // update last-split level 
foreach i € {1,..., fs(€)—1} do €.add((s’,t, l, lspuiz)) // split off retrials 


return tr // return accumulated weighted time spent in rare states 


Algorithm 1: RESTART with prolonged retrials of depth j (RESTART-P;) 


with fr > T; [21]. Function fs: {1,...,4} > N\ {0} defines splitting factors. 
fr, fL, and fs are specified by experts or derived automatically [6]. Importance 
splitting with RESTART starts a run (the main trial) from so that, whenever it 
moves up from s in current level | — 1 to s’ in level J, spawns fs(l)— 1 new child 
runs (retrials of level l) from s’. Retrials end when they move down below their 
creation level. The trials’ weights in probability estimation are appropriately 
reduced to compensate. RESTART with prolonged retrials of depth 7, denoted 
RESTART-P,;, is defined as follows in [21] (shortened and adapted to our notation): 


In RESTART-P;, each of the retrials of level 7 finishes when it leaves set 
Ci—j; that is, it continues until it down-crosses the threshold 7— j. If one 
of these trials again up-crosses the threshold where it was generated (or 
any other between i— j +1 and i), a new set of retrials is not performed. 
If j > i, the retrials are cut when they reach the threshold 0. The main 
trial, which continues after leaving set C;_;, potentially leads to new sets 
of retrials if it up-crosses threshold T; after having left set Cj_;. If the 
main trial reaches the threshold 0, it collects the weight of all the retrials 
(which has been cut at that threshold) and thus, new sets of retrials of 
level 1 are performed next time the main trial up-crosses threshold T). 


In addition, [21, Fig. 1] graphically illustrates the behaviour of RESTART-P,. The 
original RESTART [22] is RESTART-Pp. The above textual description clearly con- 
veys the core idea of RESTART-P, but we found it to omit three technical details: 
— The condition for when an up-going retrial spawns new retrials is more com- 
plex than with RESTART. We became aware of this when comparing the tex- 
tual description with the graphical depiction in [21, Fig. 1]. In fact, we need 
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to keep track of the last level at which a retrial will split, and decrement that 
value when it moves more than j levels down. (Independent replication.) 

— The definitions in [19,21] do not include 0 in the range of values for i in C; 
and T;. Our definitions would associate Ty with states s where f7(s) = 0. Im- 
plemented in FIG, this lead to increasing underestimation as the prolongation 
depth j increased. Only once we interpreted threshold 0 to refer to level 0 
(i.e. states s where fg(s) = 0) did we obtain consistent estimations across 
different j. The correctness of this interpretation was confirmed by the author 
of [19,21] in private communication. (Semi-independent replication.) 

— When a trial reaches, or spends time in, a state in F, we must weight this 
event’s influence on the statistical estimate by a factor of 1/ Are 2 (s) fs(i) in the 
original RESTART. With this weight calculation, FIG produced subtle under- 
estimations on some of the models from [20] when j > 0. We finally requested 
the source code for the original experiments and found that fz(s) must be 
replaced by the level on which the current trial was last split, i.e. the value 
must not change when moving down < j levels. (Resembles a reproduction.) 

We make these details explicit in Algorithm 1, for the case of estimating the 

long-run average time spent in F (i.e. steady-state simulation). FIG evolved as 

described above and is thus mostly a replication. modes was extended with pro- 
longations later, using a recursive formulation of the algorithm gleaned from the 

original code. It thus lacks the complete independence of a replication as per [1]. 


3 Experiments 


Table 2 in [21] provides steady-state estimates, numbers of samples, and runtimes 
obtained using RESTART-P; on a Jackson (i.e. Markov) 2-tandem queueing net- 
work for j € {0,...,4}. The same data is given in [19] for j € {0,...,2} ona 
similar system with three queues and a seven-node network, in Jackson and non- 
Jackson (using Erlang and hyperexponential distributions) variants. The original 
articles and extra material [20] describe the models, and the experimental setup: 
— The set F is characterised. E.g. for the 3-tandem network, it contains the states 
where the third queue has > L = 30 (Jackson) or 14 packets (non-Jackson). 
— All probability distributions and the fzr, f, and fs functions are characterised. 
— Tmax time values for the steady-state simulations are specified for all models. 
— The statistical evaluation aims for a relative error of 0.1 with 95 % confidence 
(except for the tandem queue, where the error is 0.005); RESTART-P runs are 
performed sequentially until the half-width of the 95% confidence interval is 
below 10% (resp. 0.5%) of the current estimate. (Note that this guarantees 
the requested confidence only asymptotically for decreasing width [9].) 
In our replication attempt, we had to resolve the following unspecified aspects: 
— The queue capacities C > L are not documented, but influence the estimate: 
for C close to L, the steady-state probability is underestimated. We settled for 
C = 20-L in FIG’s IOSA models (replication); the influence of C — L rapidly 
diminishes beyond small values. Later, from inspecting the original source 
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Table 1. Experimental results for the examples considered in [19,21] 


ee original [19,21] adapted orig. code modes FIG 
p j P n time P n time P n time p n time 
2-tandem 0 4.855-15 3909 2906 4.84E-ı5 2731 1930 4.88E-ı5 2542 988 4.85m-15 2537 4202 
(Jackson) 1 4.86n-15 3032 2107 4.93n-15 1905 1654 4.87p-15 1859 939 4.82n-15 1969 4000 
486-15 2 4.865-15 2660 2091 4.80n-15 1831 1959 4.8515 1845 1175 4.86r-15 1700 4379 
3 4.875-15 2476 2287 4.865-15 1691 2319 4.838-15 1626 1322 4.846-15 1656 5448 
4 4.85p-15 2458 3188 4.885-15 1562 2638 4.855-15 1610 1626 4.86r-15 1580 6402 
3-tandem 0 4.665-15 120 54 4.90n-15 89 28 4.24r-ı5 116 9 4.58B-15 122 43 
Jackson) 1 461p15 88 35 4.84p15 44 20490n-15 97 10 5.63515 80 36 
A86E15 2 4665-15 78 38 4.84r-15 49 19483n15 79 11 5.23n15 65 39 
3-tandem 0 7.08-15 95 137 8.388-15 728 180 8.878-15 1002 256 8.28H-15 1293 715 
non-J.) 1 7.03n15 65 90 8.50n15 661 181 8.10m15 650 182 8.65m15 618 436 
2 7.036-15 55 90 8.34m-15 388 191 8.53n-15 386 157 9.59F-15 386 402 
7-nodes 0 2.538-15 42 16 2.33E-15 44 18 2.59E-15 36 10 2.346-15 52 277 
Jackson) 1 246p15 28 12 2.508-15 34 14 2.34e5 26 11247515 32 248 
2546-15 9 Dhome 27 12241e15 20 132.638m15 25 152.42n15 32 332 
7-nodes 0 7.576-15 54 56 7.96n-15 149 52 8.98m-15 135 88 8.55n-15 202 1305 
non-J.) 1 740p15 44 52 7.37E15 92 45 7.466-15 103 84 8.03B-15 142 1323 
2 7.645-15 S90E52] 7.29£-15 79 52 8.3le-15 91 119 7.45e-ı15 126 1495 


code, we found that the queues are practically unbounded (implemented as 
32-bit integer counters), which we reproduce in the MODEST models for modes. 
— FIG by default uses the batch means technique for steady-state simulation, 
where a single run is partitioned into equal-duration batches, each of which 
provides one sample value. In communication with the original author, we 
found that each of their samples results from an independent run. We adapted 
FIG to do the same. It is the default in modes. (Semi-independent replication.) 
— We also found in this communication that the original runs perform no split- 
ting for the first 40 clients served; this part of the run is ignored as an initial 
transient phase. We confirmed this in the source code. We measured the av- 
erage time to serve 40 clients for each model and use the result as transient 
phase duration with FIG and modes since neither tool supports a transient 
phase based on clients served. (Semi-independent replication.) 
The original experiments were realised in a single file of C code that represents 
both the algorithm and the models, specialised to queueing models with trans- 
ition probabilities and service rates specified in constant arrays. In fact, the code 
we received implemented the 2-tandem queueing network only. We extended this 
code with a compile-time choice among the models described in [20], and fixed 
few small bugs. We thus have four sets of results to compare, shown in Table 1: 
the original numbers given in [19,21], plus those from our new executions of the 
adapted code, modes, and FIG. In the table, time is in seconds, p is the estimate, 
p is the true steady-state probability where it can be derived, and n is the num- 
ber of samples needed by the statistical evaluation. The adapted code and FIG 
ran on an Intel Xeon E5-2630 v3 (2.4-3.2 GHz), and modes ran on a Core i7-4790 


378 C. E. Budde and A. Hartmanns 


(3.6-4.0 GHz, 4 physical/8 logical cores) system. The adapted code and FIG are 
single-threaded whereas modes used 7 simulation threads. The adapted code is 
tailor-mode for these models, while FIG has to encode them in the more general 
IOSA framework, making it slower; modes in turn profits from a special-case im- 
plementation for CTMC to speed up the Markovian cases. Comparing runtimes 
between tools is thus of limited use. The estimates are the centers of confidence 
intervals returned by the tools with confidence and relative width as described 
above. Each (estimate, n, time) triple was selected from 5 tool executions by pick- 
ing the one with the median runtime. We underline the best runtimes among 
values for j. However, the wide confidence intervals (except for 2-tandem), few 
executions, and in principle unsound stopping criterion that we reproduce from 
the original experiments mean that results, including best values of j, vary a lot 
for different random seeds. The original experimental setup is thus insufficient 
for drawing conclusions about the precise tradeoffs between specific values of 7, 
but may at most expose an overall trend. 

Nevertheless, our estimates are mostly within the margin of error around the 
original or true results. We confirm the main experimental conclusion of [19,21]: 
as j increases, n decreases, but from some point—mostly j > 1 or 2—runtime 
increases, due to the overhead of more retrials surviving longer. For the non- 
Jackson triple tandem network, none of our results matches the numbers of [19]. 
Since the original code, albeit adapted, agrees with FIG and modes rather than 
with the original results, we suspect an error in [19] or [20] w.r.t. this one model. 


4 Conclusion 


We demonstrated the extension of the FIG and modes rare event simulation tools 
to support prolonged retrials in rare event simulation using RESTART import- 
ance splitting. These implementations and experiments were the outcome of an 
exercise in independently replicating experimental research originally performed 
in mathematics, from a computer science perspective. We confirm the key find- 
ings of the earlier work. At the same time, we document several issues—small 
but critical technical details of the algorithm and experimental setup—where 
the publicly available information was insufficient for a completely independ- 
ent replication. We in particular noticed that replicating randomised/statistical 
algorithms poses a particular challenge since small errors may result in subtle 
mis-estimations that are often drowned in the overall statistical error. In the end, 
however, all issues could be resolved due to the exceptional support, respons- 
iveness, and openness of the original author, José Villén-Altamirano, whom we 
thank earnestly. However, such support cannot be expected for experimental 
work in general, in particular where temporary staff like Ph.D. students—who 
eventually graduate and move to new institutions or industry—perform the bulk 
of the experiments. This paper thus also highlights the need for computer science 
and the formal verification community to continue their push for artifact eval- 
uation and archived, publicly available reproduction packages. A reproduction 
package for our experiments is archived at DOI 10.6084/m49.figshare. 12269462. 
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Abstract. Developing algorithms for distributed systems is an error- 
prone task. Formal models like Petri nets with transits and Petri games 
can prevent errors when developing such algorithms. Petri nets with tran- 
sits allow us to follow the data flow between components in a distributed 
system. They can be model checked against specifications in LTL on both 
the local data flow and the global behavior. Petri games allow the synthe- 
sis of local controllers for distributed systems from safety specifications. 
Modeling problems in these formalisms requires defining extended Petri 
nets which can be cumbersome when performed textually. 

In this paper, we present a web interface’ that allows an intuitive, visual 
definition of Petri nets with transits and Petri games. The corresponding 
model checking and synthesis problems are solved directly on a server. 
In the interface, implementations, counterexamples, and all intermediate 
steps can be analyzed and simulated. Stepwise simulations and interac- 
tive state space generation support the user in detecting modeling errors. 


1 Introduction 


Distributed systems consist of several individual components. Each component 
has incomplete information about the other components. Asynchronous dis- 
tributed systems have no fixed rate at which components progress but rather each 
component progresses at its individual rate between synchronizations with other 
components. Implementing correct algorithms for asynchronous distributed sys- 
tems is difficult because they have to both work with the incomplete information 
of the components and for every possible scheduling between the components. 
Petri nets [22,21] are a natural model for asynchronous distributed systems. 
Tokens represent components and transitions with more than one token corre- 
spond to synchronizations between the components. Petri nets with transits [9] 
extend Petri nets with a transit relation to model the data flow in asynchronous 
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Grant Petri Games (392735815) and through the Collaborative Research Center 
“Foundations of Perspicuous Software Systems” (TRR 248, 389792660), and by the 
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distributed systems. Flow-LTL [9] is a specification language for Petri nets with 
transits and allows us to specify linear properties on both the global and the 
local view of the system. In particular, it is possible to globally select desired 
runs of the system with LTL (e.g., only fair and maximal runs) and check the 
local data flow of only those runs again with LTL. A model checker for Petri 
nets with transits against Flow-LTL is implemented in the tool ADAMMC [10]. 
Petri games [14] define the synthesis of asynchronous distributed systems 
based on Petri nets and causal memory. With causal memory, players exchange 
their entire causal past only upon synchronization. Without synchronization, 
players have no information of each other. For safety winning conditions, the 
synthesis algorithm for Petri games with a bounded number of controllable com- 
ponents and one uncontrollable component is implemented in ADAMSYNT [12]?. 
Both tools are command-line tools lacking visual support to model Petri nets 
with transits or Petri games and the possibility to simulate or interactively ex- 
plore implementations, counterexamples, and parts of the created state space. 
In this paper, we present a web interface? for model checking asynchronous 
distributed systems with data flows and for the synthesis of asynchronous dis- 
tributed systems with causal memory from safety specification. The web inter- 
face offers an input for Petri nets with transits and Petri games where the user 
interactively creates places, transitions, and their connections with a few inputs. 
As a back-end, the algorithms of ADAMMC are used to model check Petri 
nets with transits against a given Flow-LTL formula as specification. Internally, 
the problem is reduced to the model checking problem of Petri nets against 
LTL. Both, the input Petri net with transits and the constructed Petri net can 
be visualized and simulated in the web interface. For a positive result, the web 
interface lets the user follow the control flow of the combined system and the data 
flow of the components. For a negative result, the web interface simulates the 
counterexample with a visual separation of the global and each local behavior. 
The algorithms of ADAMSYNT solve the given Petri game with safety specifi- 
cation. Internally, the problem is reduced to solving a finite two-player game with 
complete information. For a positive result, a winning strategy for the Petri game 
and the two-player game can be visualized and the former can be simulated. For 
a negative result, the web interface lets the user interactively construct strategies 
of the two-player game and highlights why they violate the specification. These 
new intuitive construction methods, interactive features, and visualizations are 
of great impact when developing asynchronous distributed systems. 


2 Web Interface for Petri Nets with Transits 


The web interface can model check Petri nets with transits against Flow-LTL. 
We use an example from software-defined networks to showcase the workflow. 


2? ADAMSYNT was previously called ADAM. From now on, ADAMMC and 
ADAMSYNT are combined in the tool ADAM (https://github.com/adamtool/adam). 
3 The web interface is open source (https://github.com/adamtool/webinterface) and 
a corresponding artifact to set it all up locally in a virtual machine is available [16]. 
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Filey View» Check Reduction ~ AdamWEB — Model Checker ? About GitHub X 
-P MODEL CHECKING 
PETRI NET WITH TRANSITS SIMULATOR APT EDITOR RESULT FOR "(G U0 ->A X 
F $2)" 
Condition Formula 5 
LTL or Flow-LTL ~ IÉ (G u0 -> AF s2) B Model checking result 
Formula: (G u0 -> A F s2) 
® Collapse Result: Not satisfied 
Counter Example (Input Petri Net with Transits) a 
Ld 
B Delete {u0,s0,s1,s2,s3,s4,s5,a3,a4,a5} [ingress> 
{u0,s0,S1,s2,S3,S4,s5,a3,a4,a5} [t3> 
@ Draw Flow {u0,s0,s1,s2,s3,s4,s5,a3,a4,a5} [t4> 
{u0,s0,s1,s2,$3,s4,85,a3,a4,a5} [t5> 
4 @ Draw Transit {u0,s0,s1,s2,83,54,35,3,a4,a5} [ingress> 
{u0,s0,s1,s2,s3,s4,s5,a3,a4,a5} [t4> 
+ Add Place 


{u0,s0,S1,s2,53,s4,s5,a3,a4,a5} [t4> 
{u0,s0,s1,S2,s3,s4,s5,a3,a4,a5} [t3> 
{u0,s0,S1,s2,83,s4,s5,a3,a4,a5} [t5> 
{u0,s0,s1,S2,s3,s4,s5,a3,a4,a5} [ingress> 
{u0,s0,S1,s2,S3,s4,s5,a3,a4,a5} [ingress> ... 


+ Add Transition 


@ Invert selection 


Gs Delete selected nodes 


++ More 


Fig. 1. Screenshot from the web interface for the model checking workflow. 


Workflow for Petri Nets With Transits One application domain for Petri 
nets with transits are software-defined networks (SDNs) [20,4]. The nodes of 
the network are switches which forward packets along the edges of the net- 
work according to the routing configuration. Packets enter the network at ingress 
switches and leave it at egress switches. SDNs separate the packet forwarding 
process, called the data plane, from the routing process, called the control plane. 
Concurrent updates to the routing configuration are difficult to get right [15]. 

The separation of data and control plane and updates to the routing con- 
figuration can be encoded into Petri nets with transits [9]. Using this encod- 
ing, we demonstrate the workflow of the web interface for model checking an 
asynchronous distributed system with data flows. The packets of the SDN are 
modeled by the data flow in the Petri net with transits. The data flow relation 
as an extension from Petri nets to Petri nets with transits is depicted as colored 
and labeled arcs. In Fig. 1, the web interface presents the resulting Petri net 
with transits M. First, we use the tools on the left to create for each switch a 
place si with 7 € {0,...,5} and add a token (cf. outer parts of M). Then, we 
create transitions for the connections between the switches and for the origin of 
packets in the SDN (cf. transition ingress in the top-left corner) and link them 
with flows in both directions. Additionally, we create local transits between the 
switches corresponding to the forwarding of packets. They are displayed in light 
blue and red and are identified by the letters. This constitutes the data plane. 

Next, we define the control plane, i.e., which forwarding is activated. Each 
transition to forward packets is connected to a place ai with ¿i € {0,...,5} which 
has a token when the forwarding is configured initially (cf. places a3, a4, and a5) 
and no token otherwise (cf. places a0, al, and a2). For the concurrent update, 
we create places ui with i € {0,...,7} and transitions ti with i € {6,...,11} 
with corresponding flows (cf. inner parts of J). 
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Transitions for the forwarding are set as weak fair, i.e., whenever a transition 
is infinitely long enabled in a run, it also has to fire infinitely often, indicated by 
the purple color of the outer transitions. Transitions for the update do not require 
fairness assumptions. A satisfied Flow-LTL formula is A F s5 specifying that all 
packets eventually reach switch s5. An unsatisfied formula is (Gu0 = A F s2) 
requiring for runs, where the update is never executed, that all packets are taking 
the lower-left route. The fairness assumptions and a maximality assumption, i.e., 
whenever some transition can fire in a run some transition fires, are automatically 
added to the formula. In the screenshot, a counterexample for the unsatisfied 
formula is displayed on the right. The first packet takes the upper-right route 
via transitions t3, t4, and t5 and the update never starts. 


Features for Petri Nets with Transits. ADAMMC [10] is a command-line 
model checking tool for Petri nets with transits and Flow-LTL [9]. The model 
checking problem of Petri nets with transits against Flow-LTL is solved by a re- 
duction to Petri nets and LTL. The web interface allows displaying and arranging 
the nodes of the Petri net from the reduction and the input Petri net with tran- 
sits. Automatic layout techniques are applied to avoid the overlapping of nodes. 
A physics control, which modifies the repulsion, link, and gravity strength of 
nodes, can be used to minimize the overlapping of edges. Heuristics generate 
coordinates for the constructed Petri net by using the coordinates of the input 
Petri net with transits to obtain a similar layout of corresponding parts. 

For a positive result, the web interface allows visualizing the data flow trees 
for given firing sequences of the nets. For a negative result, the counterexample 
can be simulated both in the Petri net with transits and in the Petri net from 
the reduction. The witness of the counterexample for each flow subformula and 
the run violating the global behavior can be displayed by the web interface. This 
functionality is helpful when developing an encoding of a problem into Petri net 
with transits to ensure that a counterexample is not an error in the encoding. 
The constructed Petri net can be exported into a standard format for Petri net 
model checking (PNML) and the constructed LTL formula can be displayed. 


3 Web Interface for Petri Games 


The web interface can synthesize local controllers from safety specifications. The 
workflow is showcased for a distributed alarm system given as a Petri game. 


Workflow for Petri Games We demonstrate the workflow of the web interface 
for the synthesis of asynchronous distributed systems with causal memory from 
safety specifications. Petri games separate the places of an underlying Petri net 
into system places and environment places. Tokens on system places are system 
players and tokens on environment places are environment players. Each player 
has causal memory: only upon synchronization with other players, they exchange 
their entire causal past. For safety specifications, the system players have to avoid 
that a bad place is reached for all behaviors of the environment players. 
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Fig. 2. Screenshot from the web interface for the synthesis workflow. 


We want to obtain two local controllers of a distributed alarm system that 
should indicate the location of a burglary at both controllers. In Fig. 2, the web 
interface presents the resulting Petri game on the left and the winning strategy 
for the alarm system on the right. The burglar is modeled by an environment 
player and each component of the distributed alarm system by a system player. 
Environment players are on white places and system players on gray ones. We 
create five environment places e0, el, e2, eL, and eR. The place e0 has a token, 
el and e2 serve for the decision to burgle a location, and eL and eR for actually 
burgling the location. Each component x € {p,q} of the alarm system has one 
system place x0 with a token, two system places x1 and x2 to detect a burglary 
and inform the other component, and two system places zL and zR to sound 
an alarm with the position of a burglary. We create rows of transitions for the 
environment player deciding where to burgle (first row), for the components de- 
tecting a burglary (second row), for the communication between the components 
(third row), and for sounding the alarm at each location (fourth row). 

At last, we use transitions fai with i € {0,...,3} and frj with j € {0,...,7} 
connected to the bad place bad to define that the implementation of the dis- 
tributed alarm system should avoid false alarms and false reports. A false alarm 
occurs if the burglar did not burgle any location but an alarm occurred, i.e., in 
every pair of places {e0} x {pL, pR, qL, qR}. A false report occurs if a burglary 
happened at a location but a component of the alarm system indicates a bur- 
glary at the other location, i.e., in every pair of places {el, eL} x {pR, qR} and 
{e2, eR} x {pL, qL}. We add transitions and flows to bad for these cases. 

The web interface finds a winning strategy (depicted on the right in Fig. 2) 
for the Petri game described above. Each component locally monitors its location 
(t2, t3) and simultaneously waits for information about a burglary at the other 
location (t4, t5). When a burglary is detected at the location of the component 
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then it first informs the other component (t4, t5) and then outputs an alarm for 
the current location (t7, t8). When a component is informed about a burglary 
at the other location, it outputs an alarm for the other location (t6, t9). 


Features for Petri Games ADAMSYNT [12] is a command-line tool for Petri 
games [14]. The synthesis problem for Petri games with a bounded number of 
system players, one environment player, and a safety objective is reduced to the 
synthesis problem for two-player games. A winning strategy in the two-player 
game is translated into a winning strategy for the Petri game. Both can be vi- 
sualized in the web interface. Here, the web interface provides the same features 
for visualizing, manipulating, and automatically laying out the elements as for 
model checking. It uses the order of nodes of the Petri game to heuristically pro- 
vide a positioning of the strategy and allows simulating runs of the strategy. The 
winning strategy of the two-player game provides an additional view on the im- 
plementation to check if it is not bogus due to a forgotten case in the Petri game 
or specification. For an unrealizable synthesis problem, the web interface allows 
analyzing the underlying two-player game via a stepwise creation of strategies. 
This guides the user towards changes to make the problem realizable. 


4 Implementation Details 


The server is implemented using the Sparkjava micro-framework [23] for incom- 
ing HTTP and WebSocket connections. The client is a single-page application 
written in Javascript using Vue.js [25], D3 [5], and the Vuetify component li- 
brary [26]. We constructed libraries out of the tools ADAMMC and ADAMSYNT 
and implemented one interface handling both libraries. Common features like the 
physics control of nodes share the same implementation. All components of the 
libraries and the web interface [2] are open source and available on GitHub [1]. 


5 Conclusion 


We presented a web interface for two tools: ADAMMC, a model checker for data 
flows in asynchronous distributed systems represented by Petri nets with transits, 
and ADAMSYNT, a synthesis tool for local controllers from safety specifications 
in asynchronous distributed systems with causal memory represented by Petri 
games. The web interface makes the modeling and debugging of Petri nets with 
transits and Petri games user-friendly as it presents visual representations of 
the input, all intermediate steps, and the output of the tools. The interactive 
features are a great assistance for correctly modeling distributed systems. 

We plan to extend the web interface and tool support to model checking 
Petri nets with transits against Flow-CTL* [11], to other classes of Petri games 
with a decidable synthesis problem [13,3], to the bounded synthesis approach for 
Petri games [7,8,19,18], and to high-level Petri games [17]. As our web interface 
is open source and easy to extend, we also plan to connect it to other tools for 
Petri nets like APT [24], LoLA [27], or TAPAAL [6]. 
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Abstract. JANI-model [6] is a model interchange format for networks 
of interacting automata. It is well-entrenched in the quantitative model 
checking community and allows modeling a variety of systems involving 
concurrency, probabilistic and real-time aspects, as well as continuous 
dynamics. Python is a general purpose programming language preferred 
by many for its ease of use and vast ecosystem. In this paper, we present 
Momba, a flexible Python framework for dealing with formal models cen- 
tered around the JANI-model format and formalism. Momba strives to 
deliver an integrated and intuitive experience for experimenting with for- 
mal models making them accessible to a broader audience. To this end, 
it provides a pythonic interface for model construction, validation, and 
analysis. Here, we demonstrate these capabilities. 


1 Introduction 


Dealing with formal models encompasses a variety of tasks which can be chal- 
lenging from time to time—especially for newcomers. Everything starts with 
the construction of a model or a family thereof. Often a textual or other, more 
formal, description of the scenario to be modeled is already existing, such as a 
rough sketch of the desired behavior or a circuit diagram. Then, after a formal 
model has finally been conceived, one has to validate that the model actually 
adequately models what should be modeled. In this regard models are just like 
any other human artifact, inadequate initially but over time it gets better. Only 
after confidence in the model has been established, one is able to harvest the 
benefits by handing over the model to analysis tools, e. g., a model checker. 

In this paper, we present Momba, a flexible Python framework for dealing 
with formal models. Momba strives to deliver an integrated and intuitive ex- 
perience to aid the process of model construction, validation, and analysis. It 
provides convenience functions for the constructions of models effectively turn- 
ing Python into a syntax-aware macro language enabling the construction of 
models in a modular fashion. Momba’s built-in simulation engine allows gaining 
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confidence in a model, for instance, by rapidly prototyping a tool for interactive 
model exploration and visualization, or by connecting it to a testing framework. 
Finally, thanks to the JANI-model [6] interchange format, several state-of-the- 
art model checkers and other tools are readily available for analysis. The latest 
version of Momba is always available on GitHub [1] and the evaluated artifact 
of this tool demo paper can be found on Zenodo [27]. 


Why Momba? The idea to harvest a general purpose programming environment 
for formal modelling is not new at all. For instance, the SVL language com- 
bines the power of process algebraic modelling with the power of the bourne 
shell. As part of many CADP installations [12,13], it is in daily use since its in- 
ception [11]. Many formal modeling tools also already provide Python bindings 
[23,10]. Momba tries not to be yet another incarnation of these ideas. 

While the construction of formal models clearly is an integral part of Momba, 
Momba is more than just a framework for constructing models with the help of 
Python. Most importantly, it also provides features to work with these models 
such as a simulator or an interface to different model checking tools. At the same 
time, it is not just a binding to an API developed for another language, say C++. 
Momba is tool-agnostic and aims to provide a pythonic interface for dealing with 
formal models while leveraging existing tools. Momba covers the whole process 
from model creation through validation to analysis. To this end, it is centered 
around the well-entrenched JANI-model interchange format. 


Why JANI? Traditionally, most analysis tools for formal models came with 
their own modeling languages and formats. The resulting fragmentation hindered 
interoperability between and comparability across different tools. JANI-model 
[6] has been conceived with the vision to put an end to this fragmentation. It 
has since been adopted by many quantitative model checkers [20,21,9] while for 
others translators have been developed [20,9] enabling cross-tool comparability 
and fostering competition within the community [22,19,7]. Recently, JANI has 
also been discovered by the planning community [24,25]. 

Momba supports all features of the JANI-model specification and some of its 
optional extensions. JANI is the natural foundation for a project like Momba. It 
provides a solid, well-established, and powerful modeling formalism for a variety 
of different kinds of systems involving concurrency, probabilistic and real-time 
aspects, as well as continuous dynamics. A JANI model is a network of interacting 
automata with variables. Attached to a model one can also specify various kinds 
of probabilistic and timed properties which can then be checked by several model 
checkers, e. g., ePMC [20], The Modest Toolset [21], and Storm [23]. The broad 
tool support for JANI models enables us to build upon existing research and to 
outsource computation-intensive tasks via unified interfaces. 


Why Python? Python is a popular high-level programming language, preferred 
by many for its ease of use and ecosystem. Especially within the data-science 
community, Python is the go-to language for data analysis and machine learn- 
ing leaveraging tools such as TensorFlow [2] and scikit-learn [29]. Around these 
tools, scientific general purpose tools such as Jupyter [26] have emerged. Jupyter 
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provides a platform for documenting scientific experiments and their results in 
a reproducible way combining code, data, and documentation. 

Our vision is to harvest Python’s ecosystem and the tools developed by the 
scientific community for dealing with formal models. Imagine, a Jupyter note- 
book documenting a model, including the code to construct it, with interactive 
visualizations of the model itself and various analysis results. 

By basing our efforts on a popular language that is appreciated by scientists 
and established in the scientific community, we hope to lower the entry barrier, 
especially for those outside the formal methods community. 


The User Perspective. In what follows, we demonstrate multiple facets of Momba 
using a variant of Racetrack, a well-known benchmark in autonomous AI decision 
making [4,31] which has recently found its use in several model checking contexts 
[16,3,15]. too. We go through the entire process from the construction of a family 
of models through their validation to their analysis. For each step, we highlight 
what Momba has to offer in terms of effectively supporting the process. 

Originally Racetrack has been a pen-and-paper game [14]. A track is a two- 
dimensional grid comprising start, goal, wall, and blank cells (cf. Fig. 1) [4]. A 
vehicle starts off with some initial velocity from a start cell, with the objective 
to reach a goal cell as fast as possible without crashing into a wall. The vehi- 
cle is controlled by nine possible actions modifying the current velocity vector. 
Racetrack naturally lends itself as a benchmark for sequential decision making 
in risky scenarios, in particular, when extended with probabilistic noise. In a 
variety of such noisy forms, it has been adopted as a benchmark for Markov 
Decision Process (MDP) algorithms in the AI community [4,5,28,30,31]. 

For our demonstration, we consider multiple variants of Racetrack giving rise 
to a family of MDPs, studied recently [3] from a feature-oriented perspective [8]. 
For example, there are different tank options and fuel is consumed according to 
various consumption models. In addition, there are different undergrounds induc- 
ing probabilistic noise modeling slippery road conditions. Clearly, this modeling 
scenario is beyond what is possible with mere model parametrization, especially 
so because we are interested in the car’s performance on different tracks each 
inducing its own MDP [4]. 


2 Scenario-Based Model Construction 


Usually, formal models are not constructed out of thin air but based on some 
kind of scenario description existing upfront. Such descriptions usually comprise 
an operational characterization of the behavior to model together with additional 
and sometimes more formal information about the specific case. Our use case is 
no exemption, here a textual description of the behavior of the car is provided 
together with a specific track and a specification of the variant. 

Naturally, Python can be used to nicely capture the formal parts of a sce- 
nario description in various data structures. Combined with a domain-specific 
parser for configuration files, scenario descriptions are interchangeable and easy 
to interface with the code for model construction. In our case, a textual represen- 
tation of the track (cf. Fig. 1) [4] is provided and parsed together with additional 
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dim: 12 35 

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXØEE 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX... 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX... 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX... 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX... 


Fig. 1. Textual representation (left) and picture of a track (right): start cells in blue 
(s), goal cells in green (g), and wall cells marked with x. 


parameters, like the size of the tank and the type of the underground, into a 
data structure tailored to that purpose. 

Now, how does Momba support the construction of models from such data 
structures? A distinguishing feature of Momba is that it effectively turns Python 
into a syntax-aware macro language enabling the modular construction of models. 
For our Racetrack use case different fuel consumption models can be captured 
as macros from JANI expressions to JANI expressions: 


linear = lambda dx, dy: expr("abs($dx) + abs($dy)", dx=dx, dy=dy) 
quadratic = lambda dx, dy: expr("$linear ** 2", linear=linear(dx, dy)) 


A macro is simply a Python function. Upon execution, these macros construct 
JANI expressions using a straightforward syntax inspired by Python expressions. 
In this case, both functions take expressions for the current velocity of the vehicle 
in x and y dimension and return an expression for the resulting fuel consumption 
which is either linear or quadratic in the velocity. In contrast to how macros work 
in languages like C, syntax-aware macros using Momba’s expr function prevent 
surprises from mere text-based expansion. Being Python functions, macros can 
be easily passed around and used elsewhere: 
assignments = { 

"fuel": expr ( 


"min(TANK_SIZE, max(0, fuel - floor($consumption)))", 
consumption=fuel_model(car_dx, car_dy), 


J 


Here, we update the fuel level by taking whatever macro has been provided 
for computing the fuel consumption. This code is part of constructing an edge 
for the tank automaton in a modular fashion in the sense that the consump- 
tion model is exchangeable. Momba provides further functions, for instance, for 
declaring variables, like fuel, and constructing automata, networks, as well as 
other model objects. Most of these functions provide all kinds of comforts, for 
instance, directly checking the types of the involved expressions. 

Using syntax-aware macros and Momba’s other convenience functions, we 
arrive at a Python script racetrack.py [27] generating a collection of JANI 
models from scenario descriptions comprising a track and specifying a variant. 
Iterating over possible scenario descriptions, hundreds of JANI models can be 
generated fully automatically and consequently be analyzed. 
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3 Validation by Simulation 


Having our models ready, we have to somehow gain confidence that they actually 
model what we want them to, before handing them over to analysis tools. One 
way of gaining confidence into a model is by simulating its behavior and manually 
checking it for consistency with the own understanding of what the model should 
do. Just like any kind of debugging, this can be a tedious and frustrating process, 
especially with text-based traces generated by some generic simulator. Momba 
instead comprises a built-in simulation engine, enabling rapid development of 
interactive visualizations. This effectively allows us to steer a vehicle through 
a track thereby exploring a model’s behavior, testing edge cases as in a racing 
game, and ultimately gaining confidence in the model. 


Momba’s built-in simulation engine supports the simulation of a variety of 
different JANI models including timed models. It has been written completely 
from scratch with easy accessibility from Python in mind. Non-determinism can 
be resolved by uniform random sampling or by querying an external oracle such 
as, in the case of our interactive visualization, the user, a testing framework, 
or even a neural network as done for DSMC [16]. For each step, the simulator 
provides all the necessary information like the binding of variables to values, 
the locations the various automata of a network are in, and the possible actions 
(and time delays for timed models) that can be taken. This information can then 
be extracted and used to display whatever is of interest for understanding and 
investigating the behavior of the model under scrutiny. 


Fig. 2 shows a simple interactive visualization of the Racetrack example based 
on Momba’s simulation engine where the user can steer the vehicle (indicated by 
the yellow asterisk) through the track by entering acceleration values. Certainly, 
there is ample room for beautification of this simulator (see TraceVis [15] for 
example) but for rapid model development this is not needed. After playing 
around with the interactive simulation for a while and testing various edge cases, 
we are confident that the model is adequate. 


dx: 2, dy: -1, fuel: 312 


Please choose an acceleration in x direction in range [-2, 2]: 2 
Please choose an acceleration in y direction in range [-2, 2]: @ 


Fig. 2. Interactive visualization using Momba’s simulation engine. 
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4 Harvesting the Benefits 


Having constructed the models and gained confidence in their adequacy, we are 
now ready to harvest the benefits of formal modeling and to apply various state- 
of-the-art analysis tools, exploiting the JANI-model interchange. Again, Momba 
provides the necessary functions to define properties and hand our models, with 
the respective properties attached to them, over to common analysis tools. 
Imagine that we are interested in the property Pmar( on_goal ^ fuel > 0), 
i.e., the maximal probability of reaching a goal cell with a non-empty tank from 
a given start cell. Using Momba’s syntax-aware macros, we first construct a 
disjunction over all goal cells and then define the property using the concise 
syntax provided by Momba’s prop function: 
on_goal = reduce(lor, (expr("car_pos == $g", g=g) for g in goal_cells), False) 
define_property ( 
prop("min({ Pmax(F($on_goal and fuel > 0)) | initial })", on_goal=on_goal), 
name="goalProbabilityFuel", 
) 
After generating a model with the vehicle starting from position (0,7) on the 
track depicted in Fig. 1 and with sand as underground, the value iteration engine 
mcsta [18] of The Modest Toolset calculates a probability of 87.5% taking 153s 
when invoked by Momba with the model. Momba also cross-checks the results 
for us, by invoking Storm’s dd engine [9] (the fastest engine for this model) and 
obtains the same result in 107s. These experiments have been carried out on a 
standard laptop with an Intel Core i7 at 2.7 GHz. 


5 Conclusion 


We presented Momba, a Python framework for dealing with quantitative models 
covering the whole process of model creation, validation, and analysis provid- 
ing an integrated and intuitive experience. In a user story on Racetrack, we 
demonstrated how Momba’s capabilities can be used throughout all stages of 
the development process of cyber-physical models. 

We demonstrated how Momba enables scenario-based model construction 
with Python code in a concise and modular way with syntax-aware macros. Using 
Momba’s simulation engine, we were able to rapidly prototype an interactive 
visualization thereby gaining confidence in our models and, finally, thanks to 
JANI-model, we demonstrated how to analyse our models with state-of-the-art 
model checkers directly invoked and cross-checked by Momba. 

By basing Momba on Python, we aim to harvest the tools developed by the 
data-science community. Especially, when combined with Jupyter [26], Momba 
enables literate programming [32] combining code, data, and documentation for 
reproducible experiments and process documentation. 

We hope that Momba helps to open up the world of formal modeling towards 
a broader community by lowering or removing barriers otherwise obstructing the 
application of formal models. Momba’s infrastructure is implemented in such a 
way that it can easily be extended into other directions and for connections to 
other research areas, e. g., model checking policies machine learned with Python 
libraries [16,17]. 
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Abstract. SV-COMP 2021 is the 10th edition of the Competition on 
Software Verification (SV-COMP), which is an annual comparative eval- 
uation of fully automatic software verifiers for C and Java programs. 
The competition provides a snapshot of the current state of the art in 
the area, and has a strong focus on reproducibility of its results. The 
competition was based on 15 201 verification tasks for C programs and 
473 verification tasks for Java programs. Each verification task consisted 
of a program and a property (reachability, memory safety, overflows, 
termination). SV-COMP 2021 had 30 participating verification systems 
from 27 teams from 11 countries. 
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1 Introduction 


Among several other objectives, the Competition on Software Verification (SV- 
COMP, https: //sv-comp.sosy-lab.org/2021) showcases the state of the art in the 
area of automatic software verification. This edition of SV-COMP is already the 
10th edition of the competition and presents again an overview of the currently 
achieved results by tool implementations that are based on the most recent ideas, 
concepts, and algorithms for fully automatic verification. This competition report 
describes the (updated) rules and definitions, presents the competition results, 
and discusses some interesting facts about the execution of the competition 
experiments. The objectives of the competitions were discussed earlier (1-4 [16]) 
and extended over the years (5-6 [17]): 


1. provide an overview of the state of the art in software-verification technology 
and increase visibility of the most recent software verifiers, 

2. establish a repository of software-verification tasks that is publicly available 
for free use as standard benchmark suite for evaluating verification software, 


This report extends previous reports on SV-COMP [10, 11, 12, 13, 14, 15, 16, 17]. 
Reproduction packages are available on Zenodo (see Table 4). 
Funded in part by the Deutsche Forschungsgemeinschaft (DFG) — 378803395 (ConVeY). 
2< dirk. beyer@sosy-lab.org 

© The Author(s) 2021 


J. F. Groote and K. G. Larsen (Eds.): TACAS 2021, LNCS 12652, pp. 401-422, 2021. 
https: //doi.org/10.1007/978-3-030-72013-1_ 24 
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3. establish standards that make it possible to compare different verification 
tools, including a property language and formats for the results, 

4. accelerate the transfer of new verification technology to industrial practice 
by identifying the strengths of the various verifiers on a diverse set of tasks, 

5. educate PhD students and others on performing reproducible benchmarking, 
packaging tools, and running robust and accurate research experiments, and 

6. provide research teams that do not have sufficient computing resources with 
the opportunity to obtain experimental results on large benchmark sets. 


The previous report [17] discusses the outcome of the SV-COMP competition 
so far with respect to these objectives. 


Related Competitions. Competitions are an important evaluation method 
and there are many competitions in the field of formal methods. We refer to 
the previous report [17] for a more detailed discussion and give here only the 
references to the most related competitions [9, 19,55, 56]. 


Quick Summary of Changes. We strive to continuously improve the compe- 
tition, and this report describes the changes of the last year. In the following 
we list a brief summary of new items in SV-COMP 2021: 


SPDX identification of licenses in SV-Benchmarks collection 

WirtnessLint: New checker for syntactical validity of verification witnesses 

Upgrade of the task-definition format to version 2.0 

Addition of several verification tasks and whole new sub-categories to the 

SV-Benchmarks collection 

e Elimination of competition-specific functions __VERIFIER_error and 
__VERIFIER_assume from the verification tasks (and rules) 

e Change in scoring schema: Unconfirmed results not counted anymore (when 
validation was applied) 

e CoVERITEAmM: New tool that can be used to remotely execute verification 
runs on the competition machines 

e Automatic participation of previous verifiers 


2 Organization, Definitions, Formats, and Rules 


Procedure. The overall organization of the competition did not change in com- 
parison to the earlier editions [10, 11, 12,13, 14, 15, 16,17]. SV-COMP is an open 
competition (also known as comparative evaluation), where all verification tasks 
are known before the submission of the participating verifiers, which is necessary 
due to the complexity of the C language. The procedure is partitioned into the 
benchmark submission phase, the training phase, and the evaluation phase. The 
participants received the results of their verifier continuously via e-mail (for 
pre-runs and the final competition run), and the results were publicly announced 
on the competition web site after the teams inspected them. The Competition 
Jury oversees the process and consists of the competition chair and one member 
of each participating team. Team representatives of the jury are listed in Table 5. 
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Table 1: Tools for witness-based result validation (validators) and witness linter 


Validator References Represent./Developer Affiliation 


CPAcuecker [22, 23,25] Martin Spiessl LMU Munich, Germany 
UAutomizer [22, 23] Daniel Dietsch Uni Freiburg, Germany 
CPA-w2t [24] Thomas Lemberger LMU Munich, Germany 
FSHeLL-w2r [24] Michael Tautschnig Queen Mary U. of London, UK 
NITWIT [78] Philipp Berger RWTH Aachen, Germany 
METAVAL [29] Martin Spiess] LMU Munich, Germany 
WITNESSLINT Sven Umbricht LMU Munich, Germany 


License Requirements. Starting 2018, SV-COMP required that the verifier 
must be publicly available for download and has a license that 


(i) allows reproduction and evaluation by anybody (incl. results publication), 
(ii) does not restrict the usage of the verifier output (log files, witnesses), and 
(iii) allows any kind of (re-)distribution of the unmodified verifier archive. 


During the qualification phase, when the jury members inspect the verifier 
archives, several issues with licenses (missing licenses, incompatibilities) were 
detected that the developers were able to address the issues on time. 

With SV-COMP 2021, the community started the process of making the 
benchmark collection REUSE compliant (https://reuse.software) by adding SPDX 
license identifiers (https://spdx.dev). A few directories are properly labeled al- 
ready, and continuous-integration checks with REUSE ensure that new con- 
tributions adhere to the standard. 


Validation of Results. This time, the validation of the verification results was 
done by seven validation tools, which are listed in Table 1, including references to 
literature. The validators CPACHECKER and UAUTOMIZER support the competition 
since the beginning of its result validation in 2015. Execution-based validation was 
added in 2018 using CPA-w2t and FSHELL-w2t. Two new validators participated 
since the previous SV-COMP in 2020: Nirwir and MetaVat. A few categories 
were still excluded from validation because no validators were available for 
some types of programs or properties. 

For SV-COMP 2021, the new validator WirnessLint was added for vali- 
dating witnesses regarding their syntax. It checks the witnesses produced by 
the verification tools against the specification of the format for verification 
witnesses (https: //github.com/sosy-lab/sv-witnesses/tree/svcomp21). For example, 
WItNeEssLint ensures that a verification witness is a proper XML/GraphML 
file and contains the required meta data. This means that the validators can 
focus on the validation of the verification result, assuming that the verification 
witness is syntactically valid. If the witness linter deems a verification witness 
as syntactically invalid, then the answers of the result validators are ignored 
and the result is not counted as confirmed. 


Task-Definition Format 2.0. The format for the task definitions in 
the SV-Benchmarks repository was recently extended to include a set of 
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options that can carry information from the verification task to the verifi- 
cation tool. SV-COMP 2021 used the task-definition format in version 2.0 
(https ://gitlab.com/sosy-lab/benchmarking/task-definition-format/-/tree/2. 0) , 
More details can be found in the report for Test-Comp 2021 [19]. 


Properties. Please see the 2015 competition report [13] for the definition of 
the properties and the property format. All specifications are available in the 
directory c/properties/ of the benchmark repository. 


Categories. The updated category structure is illustrated by Fig. 1. The 
categories are also listed in Tables 7 and 8, and described in detail 
on the competition web site (https://sv-comp.sosy-lab.org/2021/benchmarks. php). 
Compared to the category structure for SV-COMP 2020, we added 
the sub-categories XCSP and Combinations to category ReachSafety, and 
the sub-categories DeviceDriversLinux64Large ReachSafety, uthash MemSafety, 
uthash NoOverflows, and uthash ReachSafety to category SoftwareSystems. 

Another effort was to integrate some of the Juliet benchmark tasks [31] 
into the SV-Benchmarks collection. We requested a license for the Juliet 
programs that properly clarifies the license terms also outside the USA. We 
thank our colleagues from NIST for releasing their Juliet benchmark (which 
is declared as public domain) under the Creative Commons license CC0-1.0 
(https ://github.com/sosy-lab/sv-benchmarks/blob/svcomp21/LICENSES/CCO-1.0. txt) A 
SV-COMP 2021 used many verification tasks from Juliet, in particular 
for the memory-safety properties CWE121 (stack-based buffer overflow), 
CWE401 (memory leak), CWE415 (double free), CWE476 (null-pointer 
dereference), and CWE590 (free memory that is not on the heap) (see 
https://github.com/sosy-lab/sv-benchmarks/blob/svcomp21/c/MemSafety-Juliet. set) : 

All those new contributions to the benchmark collection lead to the growth 
of the number of verification tasks from 11052 in SV-COMP 2020 to 15201 
in SV-COMP 2021. 


Verification Tasks. The previous verification tasks and competition rules used 
special definitions for the functions __VERIFIER_error and __VERIFIER_assume. 
These special definitions were found to be unintuitive and inconsistent with ex- 
pectations in the verification community, and repeatedly caused confusion among 
participants. A call of function __VERIFIER_error() was defined to never return. 
A call of function __VERIFIER_assume(p) was defined such that if expression p 
evaluates to false, then the function loops forever, otherwise the function returns 
without any side effects. This led to unintended interactions with other properties. 

We eliminated these two functions in two steps. In the first step, each 
function call was replaced by a C-code implementation of the intended be- 
havior. In most of the cases, __VERIFIER_error() ; was replaced by the C code 
reach_error(); abort();, where reach_error is a ‘normal’ function, i.e., one 
whose interpretation follows the C standard [3]. 

Eliminating __VERIFIER_assume was more complicated: In some 
tasks for property memory-cleanup, __VERIFIER_assume(p); was re- 
placed by the C code assume_cycle_if_not(p);, which is implemented 
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Fig. 1: Category structure for SV-COMP 2021; category C-FalsificationOverall 
contains all verification tasks of C-Overall without Termination; Java-Overall con- 
tains all Java verification tasks; compared to SV-COMP 2020, there are two new 
sub-categories in ReachSafety and four new sub-categories in SoftwareSystems 
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Table 2: Scoring schema for SV-COMP 2021 (new: no point for unconfirmed 
correct results anymore) 


Reported result Points Description 


UNKNOWN 0 Failure to compute verification result 
FALSE correct +1 Violation of property in program was correctly found 
and a validator confirmed the result based on a witness 
FALSE incorrect —16 Violation reported but property holds (false alarm) 
TRUE correct +2 Program correctly reported to satisfy property 
and a validator confirmed the result based on a witness 
‘TRUE incorrect —32 Incorrect program reported as correct (wrong proof) 


true (witness confirmed) 


unconfirmed (false, unknown, or ressources exhausted) To] 
invalid (error in witness syntax) 0 


WITNESS_VALIDATOR 


unknown 


VERIFIER false 
TASK 
VERIFIER 


unknown 


invalid (error in witness syntax) 0 


false 


unconfirmed (true, unknown, or ressources exhausted) 


false (witness confirmed) 


Fig. 2: Visualization of the scoring schema for the reachability property (adjusted 
from a previous report [15]) 


WITNESS_VALIDATOR 


as if (!p) while(1);, while for other tasks, __VERIFIER_assume(p) ; 
was replaced by assume_abort_if_not(p);, which is implemented as 
if (!p) abort();. The solution nicely illustrates the problem of the spe- 
cial semantics: Consider property memory-cleanup, which requires that all 
allocated memory is deallocated before the program terminates. Here, the 
desired behavior of a failing assume statement would be that the program 
does not terminate (and does not unintendedly violate the memory-cleanup 
property). Now consider property termination, which requires that every 
path finally reaches the end of the program. Here, the desired behavior of a 
failing assume statement would be that the program terminates (and does 
not unintendedly violate the termination property). 

In the second step, the specifications for functions __VERIFIER_error and 
__VERIFIER_assume were removed from the competition rules (because no such 
functions exist anymore in the SV-Benchmarks collection). 


Scoring Schema and Ranking. Table 2 provides an overview and Fig. 2 visu- 
ally illustrates the score assignment for the reachability property as an example. 
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The scoring schema was changed regarding the special rule for unconfirmed 
correct results for expected result TRUE. There was a rule during the transi- 
tioning phase to assign one point if the answer matches the expected result 
but the witness was not confirmed. Now score points are only assigned if the 
results got validated (or no validator was available). 

As in the last years, the rank of a verifier was decided based on the sum 
of points (normalized for meta categories). In case of a tie, the rank was de- 
cided based on success run time, which is the total CPU time over all verifica- 
tion tasks for which the verifier reported a correct verification result. Opt-out 
from Categories and Score Normalization for Meta Categories was done as 
described previously [11] (page 597). 


3 Reproducibility 


To allow independent reproduction of the SV-COMP results, we made all ma- 
jor components that were used in the competition available in public version- 
control repositories. An overview of the components that contribute to the 
reproducible setup of SV-COMP is provided in Fig. 3, and the details are given 
in Table 3. We refer to the SV-COMP 2016 report [14] for a description of 
all components of the SV-COMP organization. 

We have published the competition artifacts at Zenodo (see Table 4) to 
guarantee their long-term availability and immutability. These artifacts comprise 
the verification tasks, the competition results, the produced verification witnesses, 
and the BENCHExEc package. The archive for the competition results includes the 
raw results in BENCHEXEC’s XML exchange format, the log output of the verifiers 
and validators, and a mapping from file names to SHA-256 hashes. The hashes 
of the files are useful for validating the exact contents of a file, and accessing 
the files inside the archive that contains the verification witnesses. 


Competition Workflow. The workflow of the competition is described in 
the report for Test-Comp 2021 [19]. 


CoVeriTeam. The competition was for the first time supported by 
CoVERITEAM [26] (https://gitlab.com/sosy-lab/software/coveriteam/), which is a 
tool for cooperative verification. Among its many capabilities, it enables remote 
execution of verification runs directly on the competition machines, which was 
found to be a valuable service for trouble shooting. 


4 Results and Discussion 


The results of the competition experiments represent the state of the art in fully 
automatic software-verification tools. The report shows the results, in terms of 
effectiveness (number of verification tasks that can be solved and correctness of 
the results, as accumulated in the score) and efficiency (resource consumption 
in terms of CPU time and CPU energy). The results are presented in the same 
way as in last years, such that the improvements compared to last year are easy 
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(a) Verification Task (b) Benchmark Definition 


(e) Verification Run 
(f) Correctness 
UNKNOWN Witness 


Fig. 3: Benchmarking components of SV-COMP and competition’s execution flow 
(same as for SV-COMP 2020) 


(c) Tool-Info Module (d) Verifier Archive 


(f) Violation 
Witness 


Table 3: Publicly available components for reproducing SV-COMP 2021 


Component Fig. 3 Repository Version 
Verification Tasks (a) github. com/sosy-lab/sv-benchmarks svcomp21 
Benchmark Definitions (b) gitlab.com/sosy-lab/sv-comp/bench-defs svcomp21 
Tool-Info Modules (c) github. com/sosy-lab/benchexec 3.6 
Verifier Archives (d) gitlab.com/sosy-lab/sv-comp/archives-2021 svcomp21 
Benchmarking (e) github. com/sosy-lab/benchexec 3.6 
Witness Format (f) github. com/sosy-lab/sv-witnesses svcomp21 


Table 4: Artifacts published for SV-COMP 2021 


Content DOI Reference 


Verification Tasks 10.5281/zenodo.4459126 [20] 
Competition Results 10.5281/zenodo.4458215 [18] 
Verification Witnesses 10.5281/zenodo.4459196 [21] 
BenchExec 10.5281/zenodo. 4317433 [82] 


to identify. The results presented in this report were inspected and approved by 
the participating teams. We now discuss the highlights of the results. 


Participating Verifiers. Table 5 provides an overview of the participat- 
ing verification systems (see also the listing on the competition web site at 
https: //sv-comp. sosy-lab. org/2021/systems. php). Table 6 lists the algorithms and 
techniques that are used by the verification tools. 


Automatic Participation. To ensure that the comparative evaluation continues 
to give an overview of the state of the art that is as broad as possible, a rule was 
introduced before SV-COMP 2020 which enables the option for the organizer to 
reuse systems that participated in previous years for the comparative evaluation. 
This option was used three times in SV-COMP 2021: for COASTAL, PREDATORHP, 
and SPF. Those participations are marked as ‘hors concours’ in Table 5. 
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Table 5: Competition candidates with tool references and representing jury members 


Participant Ref. Jury member Affiliation 

2Ls 32,63] Viktor Malik BUT, Brno, Czechia 

Brick Lei Bu Nanjing U., China 

CBMC 60] Michael Tautschnig Queen Mary U. of London, UK 
COASTAL 79] (hors concours) = 

CPA-BAM-BnB [4,81] Vadim Mutilin ISP RAS, Russia 

CPALockator [5,6] Pavel Andrianov ISP RAS, Russia 

CPACHECKER 27,41] Stephan Holzner LMU Munich, Germany 
DaRTAGNAN 48,68] Hernán Ponce de León U. Bundeswehr Munich, Germany 
DIVINE 8,61] Henrich Lauko Masaryk U., Brno, Czechia 
ESBMC-INCR 36,39] Felipe R. Monteiro Amazon Web Services, USA 
ESBMC-KIND 46,47] Lucas Cordeiro U. of Manchester, UK 

FRAMA-C 40] Martin Spiessl LMU Munich, Germany 
Gazer-THETA 1,74] Akos Hajdu BME, Hungary 

GOBLINT 73,80] Simmo Saan U. of Tartu, Estonia 

Java RANGER 76,77] Soha Hussein U. of Minnesota, USA 

Jay HORN 59,75] Hossein Hojjat U. of Tehran, Iran 

JBMC 37,38] Peter Schrammel U. of Sussex / Diffblue, UK 
JDarr 62,64] Falk Howar TU Dortmund, Germany 

Korn 45] Gidon Ernst LMU Munich, Germany 
Lazy-CSEQ 57,58] Omar Inverso Gran Sasso Science Institute, Italy 
PESCo 71, 72| Cedric Richter Paderborn U., Germany 
PINAKA 35] Saurabh Joshi IIT Hyderabad, India 
PREDATORHP 54,67] (hors concours) = 

Smack 51,70] Zvonimir Rakamaric U. of Utah, USA 

SPF 65,69] (hors concours) - 

SYMBIOTIC 33, 34] Marek Chalupa Masaryk U., Brno, Czechia 
UAUTOMIZER 52,53] Matthias Heizmann U. of Freiburg, Germany 
UKogak 44,66] Dominik Klumpp U. of Freiburg, Germany 
UTAIPAN 43,49] Daniel Dietsch U. of Freiburg, Germany 
VERIABS 2,42] Priyanka Darke Tata Consultancy Services, India 


Computing Resources. The resource limits were the same as in the previous 
competitions [14]: Each verification run was limited to 8 processing units (cores), 
15GB of memory, and 15min of CPU time. Witness validation was limited 
to 2 processing units, 7GB of memory, and 1.5min of CPU time for violation 
witnesses and 15min of CPU time for correctness witnesses. The machines 
for running the experiments are part of a compute cluster that consists of 
168 machines; each verification run was executed on an otherwise completely 
unloaded, dedicated machine, in order to achieve precise measurements. Each 
machine had one Intel Xeon E3-1230 v5 CPU, with 8 processing units each, 
a frequency of 3.4GHz, 33GB of RAM, and a GNU/Linux operating system 
(x86_ 64-linux, Ubuntu 20.04 with Linux kernel 5.4). We used BENCHExEc [28] 
to measure and control computing resources (CPU time, memory, CPU energy) 
and VERIFIERCLOUD (https://vcloud.sosy-lab.org) to distribute, install, run, and 
clean-up verification runs, and to collect the results. The values for time and 
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Table 6: Algorithms and techniques that the competition candidates used 
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energy are accumulated over all cores of the CPU. To measure the CPU energy, 
we used CPU ENERGY METER [30] (integrated in BENCHExEc [28]). 

One complete verification execution of the competition consisted of 
163 177 verification runs (each verifier on each verification task of the selected 
categories according to the opt-outs), consuming 470 days of CPU time and 
126 kWh of CPU energy (without validation). Witness-based result validation 
required 961 919 validation runs (each validator on each verification task for cate- 
gories with witness validation, and for each verifier), consuming 274 days of CPU 
time. Each tool was executed several times, in order to make sure no installation 
issues occur during the execution. Including preruns, the infrastructure managed 
a total of 1.33 million verification runs consuming 4.16 years of CPU time, and 
7.31 million validation runs consuming 3.84 years of CPU time. 


Quantitative Results. Table 7 presents the quantitative overview of all tools 
and all categories. The head row mentions the category, the maximal score 
for the category, and the number of verification tasks. The tools are listed in 
alphabetical order; every table row lists the scores of one verifier. We indicate 
the top three candidates by formatting their scores in bold face and in larger 
font size. An empty table cell means that the verifier opted-out from the respec- 
tive main category (perhaps participating in subcategories only, restricting the 
evaluation to a specific topic). More information (including interactive tables, 
quantile plots for every category, and also the raw data in XML format) is 
available on the competition web site (https://sv-comp.sosy-lab.org/2021/results) 
and in the results artifact (see Table 4). 

Table 8 reports the top three verifiers for each category. The run time (column 
‘CPU Time’) and energy (column ‘CPU Energy’) refer to successfully solved 
verification tasks (column ‘Solved Tasks’). We also report the number of tasks for 
which no witness validator was able to confirm the result (column ‘Unconf. Tasks’). 
The columns ‘False Alarms’ and ‘Wrong Proofs’ report the number of verification 
tasks for which the verifier reported wrong results, i.e., reporting a counterexample 
when the property holds (incorrect FALSE) and claiming that the program fulfills 
the property although it actually contains a bug (incorrect TRUE), respectively. 


Score-Based Quantile Functions for Quality Assessment. We use score- 
based quantile functions [11,28] because these visualizations make it eas- 
ier to understand the results of the comparative evaluation. The web site 
(https://sv-comp.sosy-lab.org/2021/results) and the results archive (see Table 4) 
include such a plot for each (sub-)category. As an example, we show the plot for cat- 
egory C-Overall (all verification tasks) in Fig. 4. A total of 10 verifiers participated 
in category C-Overall, for which the quantile plot shows the overall performance 
over all categories (scores for meta categories are normalized [11]). A more de- 
tailed discussion of score-based quantile plots, including examples of what insights 
one can obtain from the plots, is provided in previous competition reports [11, 14]. 


Alternative Rankings. The community suggested to report a couple of al- 
ternative rankings that honor different aspects of the verification process as 
complement to the official SV-COMP ranking. Table 9 is similar to Table 8, but 
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Table 7: Quantitative overview over all results; empty cells represent opt-outs; an 
asterisk after the tool name marks hors-concours participation 
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UTAIPAN 2743 1436 937 506 0 282 3336 7676 
VERIABS 5771 
CoastaL 298 
Java RANGER 630 
Jay HORN 369 
JBMC 603 
JDART 623 
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Table 8: Overview of the top-three verifiers for each category (measurement values for 
CPU time and energy rounded to two significant digits) 


Rank Verifier Score CPU CPU Solved Unconf. False Wrong 
Time Energy Tasks Tasks Alarms Proofs 
(inh) (in kWh) 


ReachSafety 

1 VERIABS 5771 130 1.5 3526 725 

2 CPACHECKER 4764 100 1.2 2922 251 6 
3 PESCO 4526 53 0.48 2 820 272 T 
MemSafety 

1 SYMBIOTIC 3125 1.6 0.021 370 8 

2 CPACHECKER 2992 7.8 0.069 3092 0 

3 UAUTOMIZER 1615 4.1 0.046 160 2 
ConcurrencySafety 

1 Lazy-CSEQ 1206 4.0 0.051 985 34 

2 CPACHECKER 1050 16 0.13 903 0 1 
3 UAUTOMIZER 943 9.6 0.087 775 176 
NoOverfiows 

1 CPACHECKER 531 1.2 0.012 366 3 

2 UAUTOMIZER 512 hee 0.015 358 0 

3 UTAaIpPAN 506 1.9 0.018 355 0 
Termination 

1 UAvuTOMIZER 3019 22 0.24 1581 9 

2 CPACHECKER 1356 17 0.20 1078 70 10 
3 2Ls 1315 2.5 0.021 977 363 3 
SoftwareSystems 

1 SYMBIOTIC 2001 0.55 0.0075 1024 128 

2 Smack 894 14 0.14 1362 58 2 
3 PrESCo 878 27 0.27 1484 234 1 
FalsificationOverall 

1 CPAcHEcKER 4356 71 0.76 3814 98 8 
2 PESCo 4329 47 0.41 3798 106 

3 U AUTOMIZER 3432 30 0.30 1585 215 1 
Overall 

1 CPAcHEeEckeR 12217 190 2l 9835 514 18 
2 PESCo 12208 120 1:2 9 743 579 19 
3 UAUTOMIZER 11769 99 1.0 5 980 489 1 1 
JavaOverall 

1 Java RANGER 630 4.9 0.056 427 0 

2 JDART 623 0.93 0.0093 437 0 

3 Jpmc 603 0.22 0.0022 423 0 


contains the alternative ranking categories Correct and Green Verifiers. Column 
‘Quality’ gives the score in score points, column ‘CPU Time’ the CPU usage of 
successful runs in hours, column ‘CPU Energy’ the CPU usage of successful runs 
in kWh, column ‘Solved Tasks’ the number of correct results, column ‘Wrong Re- 
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Fig. 4: Quantile functions for category C- Overall. Each quantile function illustrates 
the quantile (x-coordinate) of the scores obtained by correct verification runs 
below a certain run time (y-coordinate). More details were given previously [11]. 
A logarithmic scale is used for the time range from 1s to 1000s, and a linear 
scale is used for the time range between 0s and 1s. 


Table 9: Alternative rankings for catagory Overall, quality is given in score 
points (sp), CPU time in hours (h), kilo-watt-hours (kWh), wrong results in 
errors (E), rank measures in errors per score point (E/sp), joule per score point 
(J/sp), and score points (sp) 


Rank Verifier Quality CPU CPU Solved Wrong Rank 
Time Energy Tasks Results Measure 


(sp) (h) (kWh) (E) 
Correct Verifiers (E/sp) 
1 UAvutTomizerR 11769 99 1.0 5 980 2 -00017 
2 UKoJAK 4332 46 0.48 2476 1 -00023 
3 CPACHECKER 12217 190 2.1 9 835 18 -0015 
worst 48 .023 
Green Verifiers (J/sp) 
1 SYMBIOTIC 9 268 21 0.26 4999 16 100 
2 2LS 6219 26 0.24 3372 12 140 
3 CBMC 5289 26 0.31 5596 52 210 
worst 630 


sults’ the sum of false alarms and wrong proofs in number of errors, and column 
‘Rank Measure’ gives the measure to determine the alternative rank. 


Correct Verifiers — Low Failure Rate. The right-most columns of Table 8 re- 
port that the verifiers achieve a high degree of correctness (all top three ver- 
ifiers in the C-Overall have less than 2 %o wrong results). The winners of cat- 
egory Java-Overall produced not a single wrong answer. The first category in 
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Table 10: New verifiers in SV-COMP 2020 and SV-COMP 2021 


Verifier Language First Year Sub-categories 
FRAMA-C C 2021 4 
GAZER- THETA C 2021 9 
GOBLINT C 2021 25 
KORN C 2021 13 
Brick C 2020 1 
DARTAGNAN G 2020 5 
GACAL C 2020 1 
COASTAL Java 2020 1 
Java RANGER Java 2020 1 
JDART Java 2020 1 


Table 11: Confirmation rate of verification witnesses in SV-COMP 2021 


Result TRUE FALSE 

Total Confirmed Unconf. Total Confirmed Unconf. 
2LS 2252 2245 99.7% T 1591 1127 70.8 % 464 
CBMC 3875 3498 90.3 % 377 3772 2098 55.6 % 1674 
CPACHECKER 5992 5646 94.2 % 346 4357 4189 96.1% 168 
DIVINE 1673 1649 98.6 % 24 1317 986 74.9% 331 
ESBMC-xIND 4954 4901 98.9% 53 1736 1625 93.6 % 111 
PESCo 5973 5570 93.3% 403 4349 4173 96.0 % 176 
SYMBIOTIC 3351 3149 94.0% 202 2166 1850 85.4% 316 
UAUTOMIZER 4121 3 856 93.6 % 265 2348 2124 90.5 % 224 
UKosaKk 1816 1796 98.9% 20 690 680 98.6% 10 
UTAIPAN 2602 2542 97.7% 60 1637 1417 86.6 % 220 


Table 9 uses a failure rate as rank measure: number of Incorrect results the number of 


errors per score point (E/sp). We use E as unit for number of incorrect results 
and sp as unit for total score. The worst result was 0.032 E/sp in SV-COMP 2020 
and is now improved to 0.023 E/sp. 


Green Verifiers — Low Energy Consumption. Since a large part of the cost of 
verification is given by the energy consumption, it might be important to also 
consider the energy efficiency. The second category in Table 9 uses the energy 
consumption per score point as rank measure: es with the unit J/sp. 
The worst result from SV-COMP 2020 was 2200 J/sp, now improved to 630 J/sp. 
New Verifiers. To acknowledge the verification systems that participate for 
the first or second time in SV-COMP, Table 10 lists the new verifiers (in 


SV-COMP 2020 or SV-COMP 2021). 


Verifiable Witnesses. Results validation is of primary importance in the compe- 
tition. All SV-COMP verifiers are required to justify the result (TRUE or FALSE) 
by producing a verification witness (except for those categories for which no wit- 
ness validator is available). We used six independently developed witness-based 
result validators and one witness linter (see Table 1). 
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Fig. 5: Number of evaluated verifiers for each year (first-time participants on top) 


Table 11 shows the confirmed versus unconfirmed results: the first column 
lists the verifiers of category C-Overall, the three columns for result TRUE reports 
the total, confirmed, and unconfirmed number of verification tasks for which the 
verifier answered with TRUE, respectively, and the three columns for result FALSE 
reports the total, confirmed, and unconfirmed number of verification tasks for 
which the verifier answered with FALSE, respectively. More information (for all 
verifiers) is given in the detailed tables on the competition web site and in the 
results artifact; all verification witnesses are also contained in the witnesses 
artifact (see Table 4). The verifiers 2Ls and UKouak are the winners in terms 
of confirmed results for expected results TRUE and FALSE, respectively. The 
overall interpretation is similar to SV-COMP 2020 [17]. 


5 Conclusion 


The 10th edition of the Competition on Software Verification (SV-COMP 2021) 
had 30 participating verification systems from 11 countries (see Fig. 5 for the 
participation numbers and Table 5 for the details). The competition does not only 
execute the verifiers and collect results, but also validates the verification results 
using verification witnesses. We used six independent validators to check the 
results and a witness linter to check if the verification witnesses are syntactically 
valid (Table 1). The number of verification tasks was increased to 15 201 in the 
C category and to 473 in the Java category. The high quality standards of the 
TACAS conference, in particular with respect to the important principles of 
fairness, community support, and transparency are ensured by a competition jury 
in which each participating team had a member. The results of our comparative 
evaluation provide a broad overview of the state of the art in automatic software 
verification. SV-COMP is instrumental in developing more reliable tools, as well 
as identifying and propagating successful techniques for software verification. 


Data Availability Statement. The verification tasks and results of the 
competition are published at Zenodo, as described in Table 4. All compo- 
nents and data that are necessary for reproducing the competition are avail- 
able in public version repositories, as specified in Table 3. Furthermore, the 
results are presented online on the competition web site for easy access: 
https://sv-comp.sosy-lab.org/2021/results/. 
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Abstract. Our submission to SV-COMP’21 is based on the software 
verification framework CPAcuecker and implements the extension to the 
thread-modular approach. It considers every thread separately, but in 
a special environment which models thread interactions. The environ- 
ment is expressed by projections of normal transitions in each thread. 
A projection contains a description of possible effects over shared data 
and synchronization primitives, as well as conditions of its application. 
Adjusting the precision of the projections, one can find a balance between 
the speed and the precision of the whole analysis. 

Implementation on the top of the CPAcnecker framework allows combining 
our approach with existing algorithms and analyses. Evaluation on the 
sv-benchmarks confirms the scalability and soundness of the approach. 


Keywords: Multithreading - Projection - Thread-modular approach 


1 Verification Approach 


The main challenge for verification of industrial multithreaded software is to 
consider a potential thread interaction efficiently. Our verification approach is 
based on the thread-modular technique [4,5]. The approach allows avoiding a 
cartesian product of thread states by considering each thread state separately. 
Thus, an abstract state is not a complete one anymore and represents only one 
thread in a partial abstract state. However, due to this, the analysis has no 
information about transitions in other threads, which are strongly required for 
the soundness of the analysis. Thus, to not lose soundness we have to take into 
account the influence of other threads to the considered thread. For that purpose, 
we compute a special representation of the environment, which consists of a set of 
thread transitions, so-called projected transitions, or projections. The projections 
may be more or less precise, which strongly affects the precision and speed of 
the whole analysis. Note, the projections are independent and thus, a correct 
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sequence is missed. Potentially, all projections may affect the other thread in any 
time. It is an overapproximation, leading to an imprecise analysis. 

Let us explain, how we increase precision considering only compatible projec- 
tions. 


Thread1 Environment 


Pract > Ay 


Fig. 1. Computation of a thread environment and its application 


The figure 1 shows one step of the analysis. After computation of an abstract 
state in the first thread, we should spread the effect (x is a shared variable) to the 
other threads. Thus, we compute a projection of the operation. The projection 
is a part of the environment and affects the other threads through it. Then we 
apply a new effect to the other threads. 

In the example, we lose the precision of the effect, abstracting from the 
assigned value (x = *). One of the key ideas of the proposed approach is to 
extend abstraction not only to states but also to operations, i.e. transitions. Thus, 
the projection may look like x = 1 and * = * in other configurations. That 
allows adjusting the level of abstraction of the environment for a specific task. 
By adjusting the configuration it is possible to vary not only an abstraction level 
but also to construct an algorithm that may be closer either to data-flow analysis 
or to software model checking. 

To be able to construct precise analysis we suggest to encode not only abstract 
operations but also some conditions of its application, so-called guards. The guards 
are related to a predecessor abstract state, but they are not required to be equal 
to it. The guards store some information about variable values, locks, threads, or 
even abstract predicates. In the figure 1 the guard contains information about the 
initial value of the modified variable x (x == 0). A projection may be applied to 
a particular state if the guards allow it. We say, that the projection is compatible 
to an abstract state of the other thread. In our example the effect x = * may be 
applied to the other thread only if the corresponding state does not contradict 
the condition x == 

More information about the approach and theoretical preliminaries can be 
found in [1]. Practical application of the theory to the Linux kernel drivers can 
be found in [2]. 


2 Software Architecture 


CPALOocKATOR is based on the CPACHECKER framework and has the same software 
architecture. Its key concept is CPA [3]. Each abstract domain is implemented in 
its own CPA. CPAs in the framework, i.e. value analysis or predicate analysis, 
can be combined to build an efficient and more precise approach. A configurable 
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algorithm, CEGAR in case of CPALockaror, uses CPAs to construct a set of 
reachable states. In the figure 2 current configuration is presented. The highlighted 
components are implemented and used only in CPALocKator. Lock analysis 
tracks acquired locks. It helps to compute thread effects that can be applied to 
a particular thread. Thread analysis determines whether two code blocks may 
be executed in parallel. Predicate analysis is extended to handle environment 
actions. It allows constructing a predicate abstraction in a thread-modular case. 
More information about CPALockator may be found in [1,2]. 


Fig. 2. Different CPAs in CPALockator configuration 


3 Strengths and Weaknesses 


First, we need to emphasize that the tool is targeted and used in practice for 
finding bugs in large industrial software systems, for example, operating system 
cores. We applied the tool to the Linux kernel and a number of private kernels of 
real-time OS. The main challenge is scalability there. And results on small but 
tricky sv-benchmarks look poor, just because of trade-off scalability vs. precision. 
Our tool is not so precise as other participants, but we show our scalability on a 
small set of complicated sv-benchmarks. However, it is useful for the community 
to have such comparison. 

The thread-modular approach cannot solve tasks that contain control de- 
pendencies in the environment, as we consider all projections independently 
from each other and thus we lose their order. This is also a problem for witness 
validation, as the tool provides a path only in a single thread. It is a limitation 
of the approach, not only the tool itself. In practice we use more user-friendly 
format to analyze, visualize and evaluate error traces than witness validation [6]. 
However, the approach allows to simplify thread interaction, and the benefit is 
considerable for large complicated tasks, which cannot be analyzed with precise 
model checkers. 

As the approach shows benefit for complicated tasks, like in Idv-linux-3. 14- 
races directory. CPALOCKATOR correctly solves 4 of 7 those benchmarks and for 
one more obtains an imprecise counterexample. The rest of two tasks may be 
solved in the other, more faster, CPALOCKATOR configuration. The other tools 
mostly have problems with the benchmarks due to their complexity and size. 
The explanation of the results is rather evident. Most of the tools try to consider 
precise interaction between threads, while CPALocKaATor abstracts from it and 
considers each thread separately. Note, the benchmarks have a strong hint for 
verifiers: there is only one assert to check while in the real world nobody knows 
where the bug may be located. 
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Overall results are not so good because of problems related both to the 
approach itself and its implementation. The majority of unknowns are related 
to unsupported atomic operations, like _atomic_ functions, compare_and_swap 
and so on. Currently, our tool supports only synchronization operations based 
on locks, as the industrial software mostly contains them. Another problem is 
related to predicate analysis and interpolation. The current implementation of 
an interpolation procedure cannot produce interpolants for other threads, which 
limits the power of predicate analysis. Other problems are also present, but they 
are not so significant. 

Anyway, CPALOCKATOR does not produce incorrect true verdicts, which 
confirms the soundness of the approach. All produced true verdicts are confirmed 
by validators, however, its amount is not so numerous, as we skip all tasks with 
unsupported functions. Thus, the presented approach may be used in combination 
with more precise techniques. 


4 Tool Setup and Configuration 


We submitted CPALocKkaror? built from svn revision 36155 for participation 
in the category Concurrency. The tool requires a Java 11 runtime environment. 
CPACHECKER has to be executed with the following command line: 


scripts/cpa.sh -svcomp21-lockator -spec reach.prp program.i 


or via BenchExec tool. 


5 Project and Contributors 


The CPACHECKER project is mainly developed by an international research group 
from the Ludwig-Maximilian University of Munich. CPALOCKATOR is based on 
CPACHECKER and is developed and supported by researchers from Ivannikov 
Institute for System Programming of the Russian Academy of Sciences. We thank 
Dirk Beyer and the CPACHECKER team for their work and fruitful discussions. 
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Abstract. We describe the new features of the bounded model checker 
DARTAGNAN for SV-COMP’21. We participate, for the first time, in 
the ReachSafety category on the verification of sequential programs. In 
some of these verification tasks, bugs only show up after many loop iter- 
ations, which is a challenge for bounded model checking. We address the 
challenge by simplifying the structure of the input program while pre- 
serving its semantics. For simplification, we leverage common compiler 
optimizations, which we get for free by using LLVM. Yet, there is a price 
to pay. Compiler optimizations may introduce bitwise operations, which 
require bit-precise reasoning. We evaluated an SMT encoding based on 
the theory of integers + bit conversions against one based on the the- 
ory of bit-vectors and found that the latter yields better performance. 
Compared to the unoptimized version of DARTAGNAN, the combination 
of compiler optimizations and bit-vectors yields a speed-up of an order 
of magnitude on average. 


1 Overview 


DARTAGNAN is a bounded model checking (BMC) tool for reachability analysis. 
It takes a program and converts it to an SMT formula representing all its execu- 
tions up to a given bound. This formula, together with a reachability condition 
representing assertions, is passed to an SMT solver (we use Z3 as a backend). If 
the formula is satisfiable, an execution violating an assertion exists. 
DARTAGNAN was initially developed to verify small concurrent programs 
(written in the .litmus format) under weak memory models. Since 2020, it also 
supports Boogie intermediate verification language as its input language. For C 
programs, we use SMACK [8] to compile to LLVM and transform the compiled 
code to Boogie. DARTAGNAN’s architecture, and main verification techniques 
(in particular how to efficiently handle different memory models) are described 
in [3,4,7]. Version 2.0.7 participating in SV-COMP?’21 [1] can be downloaded 
from https://github.com/hernanponcedeleon/Dat3M directly as a java archive 
(.jar) or built from source code using the Maven build system. DARTAGNAN’s 
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int main(void) { 
unsigned int x 
unsigned int y 


13 
0; 


while (y < 1024) { 
x = 0; 
YER; 


} 


__VERIFIER_assert(x == 0); 


Fig. 1. Benchmark const_1-1.c from the ReachSafety-Loop category. 


verifier archive to reproduce the results of SV-COMP’21 is published at Zenodo 
under DOI 10.5281 /zenodo.4483224. 

Last year DARTAGNAN only participated in the ConcurrencySafety category. 
What is new for SV-COMP’21 is that DARTAGNAN also participates in (part of) 
the ReachSafety category for single threaded programs. Many tasks in that cat- 
egory contain loops of large bounds which impacts DARTAGNAN’s performance. 
To address the problem, we propose to leverage compiler optimizations. 


2 Leveraging Compiler Optimizations 


BMC techniques are very sensitive to the program syntax. The loop structure 
and the number of variables directly impact the size of the SMT formula (which 
tends to relate to solving times). Our approach is to simplify the structure of 
the program (while preserving its semantics) before performing the verification. 
We do this by using compiler optimizations. 

Consider the program in Fig. 1 from the ReachSafety-Loop category. A BMC 
tool has to unroll the program 1024 times to prove the program correct. However, 
since the value of x is constant at every loop iteration, the assignment can be 
moved outside the loop. Since the value of y is never read, the instruction y++ 
can be removed (using dead store elimination) leading to an empty loop which 
can also be removed. Finally, using constant propagation, the assertion can be 
re-written as __VERIFIER_assert (0 == 0) which is trivially true. 

All these optimizations are implemented in most optimizing compilers. Since 
we perform the verification after compiling to LLVM, we get them for free. Due 
to the high number of loop iterations, DARTAGNAN needs more than 15 minutes 
to verify the program above. However, by using the -03 optimization flag in the 
C-to-Boogie transformation, the verification task can be solved within seconds. 

Using an optimizing compiler has its risks. Most optimizations are unsound 
for concurrent programs [9] and we do not use any for ConcurrencySafety. Even 
for sequential programs, there is a price to pay. Some optimizations introduce 
bitwise operations (e.g. multiplications tend to be compiled to shift operations) 
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which were not present in the original program. We thus have to encode the 
semantics of such operations precisely. 


3 The Price of Precision 


To guarantee soundness when using the aforementioned compiler optimizations 
in the ReachSafety category, we use two precise encodings of integers. The first 
is a new implementation based on the theory of bit-vectors, where we get bit- 
precise reasoning for free. The second was our original implementation and it 
is based on the theory of integers. It does an on-demand conversion to bit- 
vectors and back (Int2Bv and Bv2Int). We are able to solve more benchmarks 
with the theory of bit-vectors than with the theory of integers plus conversion, 
which suggests that converting between the theories is expensive. For concurrent 
programs, the combination of bit-vectors with DARTAGNAN’s memory-model- 
dependent encoding significantly degrades performance, and we use the theory 
of integers throughout the ConcurrencySafety category. 

The trade-off between the efficiency of a theory and the precision in modeling 
semantics is well-known. In the context of symbolic execution, it was explored 
in [6]. SMACK implements an approach to diagnose spurious counterexamples 
caused by over-approximations and gradually refines the precision of reasoning 
about bitwise operations [5]. 


4 Evaluation 


We evaluated how compiler optimizations and different integer encodings affect 
DARTAGNAN’s verification capabilities for some benchmarks in the ReachSafety 
category. We support two levels of optimization: -00 (no optimization) and -03 
(enables most optimizations). For integer encodings we use two different ap- 
proaches: theory of integers + bit conversions (QF_LIA + QF_BV logics) and pure 
theory of bit-vectors (QF_BV logic). 

The results are given in Fig. 2. We use BENCHEXEC [2] for reliable benchmark- 
ing. The graph shows the verification time w.r.t the verification score. Following 
the competition scheme, correct counter-examples and proofs give +1 and +2 
points respectively. Wrong counter-examples and proofs give -16 and -32 points. 
The absolute score values for incorrect results are higher because a single correct 
answer should not compensate for a wrong answer. 

It can be seen that, regardless of the chosen integer encoding, using compiler 
optimizations allows us to verify many more benchmarks, thus obtaining a higher 
score. The total number of solved tasks with no optimizations (00+Bit-vectors 
and 00+Int-exact configurations from Fig. 2) is 89 with 77 correct and 12 in- 
correct results. When using optimizations (03+Bit-vectors and 03+Int-exact 
configurations), we solved 336 tasks with 326 correct and 10 incorrect results. 

The experiments show that combining theories to achieve precision is more 
expensive than using pure bit-vectors. The total number of solved tasks when 
using QF_LIA + QF_BV (configurations 00+Int-exact and 03+Int-exact) is 201 
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Dartagnan 2020-12-08 21:31:49 UTC 00 + Bit-vectors 
Dartagnan 2020-12-08 21:31:49 UTC 03 + Bit-vectors 
Dartagnan 2020-12-09 12:22:51 UTC 00 + Int-exact 
Dartagnan 2020-12-09 12:22:51 UTC 03 + Int-exact 


Fig. 2. Comparing the performance of DARTAGNAN with different optimization flags 
and integer encodings. 


with 187 correct and 14 incorrect results. When using QF_BV (configurations 
00+Bit-vectors and 03+Bit-vectors) we solved 224 tasks with 216 correct 
and 8 incorrect results. All encodings are guaranteed to be sound, the incorrect 
results are due to bugs in the verifier. 


We used the evaluation described above to decide the configuration for SV- 
COMP’21. For category ConcurrencySafety, we use the integer encoding and no 
compiler optimizations. For categories ReachSafety-Loop, ReachSafety-Bit Vectors 
and ReachSafety-Arrays, DARTAGNAN uses the theory of bit-vectors and -03 op- 
timizations. These configurations are internally decided by the tool based on the 
use of the pthreads library. Compared with SV-COMP’20, we solved 60 more 
tasks in ConcurrencySafety (55% increase) and 474 more tasks overall (582% 
increase). 


Acknowledgement: We thank the SMACK developers for their constant sup- 
port with the C-to-Boogie transformation. We also thank Yun Zhang for her 
contributions to the development of the witness generation. 
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Abstract. GAZER-THETA is a software model checking toolchain in- 
cluding various analyses for state reachability. The frontend, namely 
GAZER, supports C programs through an LLVM-based transformation 
and optimization pipeline. GAZER includes an integrated bounded model 
checker (BMC) and can also employ the THETA backend, a generic ver- 
ification framework based on abstraction-refinement (CEGAR). On SV- 
COMP 2021, a portfolio of BMC, explicit-value analysis, and predicate 
abstraction is applied sequentially in this order. 


1 Verification Approach and Software Architecture 


GAZER-THETA is a software model checking toolchain with two main compo- 
nents: GAZER, an LLVM-based frontend and THETA, a generic model checking 
framework. An overview of the architecture and the verification approach can 
be seen in Figure 1. 


: GAZER y 
C code BMC op 
| : i m o/eo 
r H o 
l - S 
, a ‘el ea — Automate > Z3 solver g Witness 
compiler IR translation T is 
t Ẹ 
| Tar £ Harness 
LLVM CEGAR 
passes Kj v. 
» 4 
Predicate Explicit 
analysis analysis 


Fig. 1. Overview of the architecture. Solid arrows represent the workflow, dashed ar- 
rows indicate dependency. GAZER and THETA components are denoted by lighter and 
darker backgrounds, respectively. 
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Gazer. GAZER [7] is a verification frontend for C programs written in C++17, 
using the LLVM compiler infrastructure. The input is a C program (possi- 
bly consisting of multiple source files) that is first translated to the LLVM IR 
(intermediate representation) using the clang compiler. Next, various built-in 
and custom LLVM passes are executed to perform optimizations (e.g., inlining, 
constant propagation, assertion lifting) and transformations (e.g., adding trace- 
ability information) on the IR. The LLVM IR is then transformed into different 
variants of control flow automata (CFA), depending on the backend to be used. 
GAZER includes a built-in variant [5,7] of bounded model checking [2], relying on 
the z3 SMT solver [6]. The other supported backend is THETA (to be presented 
below). Currently, both backends provide analysis for reachability properties. 
In the final step, the “raw” results of the backends are processed to produce 
a verdict (safe, unsafe, unknown) and a witness. Currently, GAZER only sup- 
ports violation witnesses, both in a user-friendly syntax and in the format of 
SV-COMP. Furthermore, GAZER is also capable of generating executable test 
harnesses that can be used, e.g., in a debugger to reach the property violation. 


Theta. THETA [8] is a generic and modular model checking framework written 
in Java 11, providing abstraction- and CEGAR-based analyses [4] for various 
formalisms, including CFA. THETA is highly configurable, supporting different 
abstract domains (such as explicit-value analysis [1] or predicate abstraction [3]) 
and refinement strategies, mostly based on interpolation (using SMT solvers such 
as Z3 [6]). In the explicit-value analysis, only a subset of program variables is 
tracked, while predicate abstraction keeps track of logical facts and relationships 
instead of concrete values. 


Verification portfolio. Based on our preliminary experiments, at SV-COMP 2021, 
we apply a sequential portfolio consisting of 3 steps, as illustrated by Figure 2. 
The portfolio is implemented as a Python script, which calls the tools described 
previously. First, bounded model checking is performed with a 150s time limit, 
which — in our experience — can already solve many unsafe instances. If BMC is 
inconclusive, we move on to an explicit-value analysis with a 100s limit, which 
can be effective for simpler, mostly deterministic programs. Finally, if the result 
is still unknown, we move on to the more heavyweight method of predicate ab- 
straction. If any of the phases reports an unsafe result, as an additional step, 
we generate an executable test harness from the counterexample and check if 
the program actually reaches the property violation. This allows us to filter out 
some false positives (by reporting unknown instead of unsafe). 


2 Strengths and Weaknesses 


GAZER-THETA currently targets reachability analysis so we participate in the 
ReachSafety category, excluding subcategories Arrays, Heap and Sequentialized, 
due to features with limited support (e.g., pointers). The strength of the tool is 
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Execute cex. 


Execute cex. 


Execute cex. 
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ô 


——| GAZER BMC 


THETA expl. 


THETA pred. 


@ Unknown 


© Safe 


Fig. 2. Overview of the portfolio approach. Symbols @, @, @ indicate safe, inconclusive 
and unsafe results, respectively. Numbers indicate the time limit of each phase. 


its modularity and configurability, combining the advantages of different anal- 
yses into a diverse portfolio. Out of the 3679 tasks, there are 1722 confirmed 
correct (1079 safe, 643 unsafe), 4 unconfirmed correct, and 13 incorrect (false 
positive) results. A majority of the solved tasks (86% of 1722) come from the 
BMC phase; with a few exceptions, the CEGAR analyses need to be utilized only 
for safe instances (though they could also handle most of the tasks solved by 
BMC based on our experiments). The explicit-value analysis handles further 100 
tasks in the ECA subcategory, while predicate abstraction solves 130 additional 
instances from Loops and ProductLines. Surprisingly, BMC can actually solve 
a significant amount (857) of safe instances as well, which can be attributed to 
LLVM optimizations and enhancements in the algorithm [7]. Furthermore, we 
also observed that executable harnesses could rule out many (142) false positives. 

The weakness of GAZER-THETA is its limited support for certain features, 
such as arrays, bit-precise reasoning (only available for BMC), and pointers. We 
also observed that the LLVM IR representation often results in large CFA (e.g., 
many temporary variables due to SSA form), which makes reasoning harder 
via CEGAR (as witnessed, e.g., by the ECA subcategory). Currently, the tool 
gives empty correctness witnesses only meeting syntactical requirements, but 
surprisingly most of them were accepted. Furthermore, our violation witnesses 
are quite “sparse” due to heavy usage of optimization passes, but some validators 
can still prove their correctness. The 13 false positive results are caused by 
unsupported library functions (related to floats) treated as external calls with 
undefined (arbitrary) behavior. 


3 Tool Setup and Configuration 


The competition contribution is based on GAZER v1.2.14 and THETA v2.5.0.° 
Additionally, the BMC backend of GAZER uses Z3 version 4.8.6, while THETA 
is based on Z3 version 4.5.0. The projects’ repositories contain instructions on 
building the tools, but an archive can be found on Zenodo® with pre-built binaries 


* https: //github.com/ftsrg/gazer/releases/tag/v1.2.1 
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for Ubuntu 18.04 or 20.04. The toolchain requires packages clang-9, libgomp1, 
llvm-9, openjdk-11-jre-headless and python3 to be installed. The entry 
point of the toolchain is scripts/gazer_starter.py, which takes the verifi- 
cation task (C program) as its only mandatory input and runs the portfolio. No 
other parameters or configuration is required. Optionally, the output directory 
can be set (--output) and the version can be queried (--version). 


4 Software Project 


GAZER and THETA are maintained by the Critical Systems Research Group” of 
the Budapest University of Technology and Economics with various contributors. 
The projects are available open-source on GitHub® under an Apache 2.0 license. 
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Abstract. GOBLINT is a static analysis framework for C programs spe- 
cializing in data race analysis. It relies on thread-modular abstract in- 
terpretation where thread interferences are accounted for by means of 
flow-insensitive global invariants. 


1 Verification Approach 


GOBLINT is a static analyzer for C programs based on the framework of ab- 
stract interpretation [5]. It performs flow- and context-sensitive interprocedural 
analysis, using partial tabulation to handle procedure calls. The analysis of con- 
current programs is thread-modular: analyzing each thread in isolation, as op- 
posed to analyzing their interleavings. This scales well to larger programs with 
many threads. Interferences between threads happen through global variables, 
which are abstracted by a context- and flow-insensitive global invariant. When 
no other thread can interfere, copies of global variables are privatized within 
the local state. Their values may deviate from the global invariant due to local 
updates, thereby improving precision [11]. 

The analysis is specified using a side-effecting constraint system [3], in which 
right-hand sides of constraints can, during their evaluation, make additional con- 
tributions (side effects) to other constraint system variables. These side effects 
can be conveniently used both to express partial context-sensitivity of function 
calls and to add contributions to the global invariant. Such a constraint system 
is solved using a local generic solver, which yields a (post-)solution for just the 
reachable program points and contexts [1,8]. Solving is not strictly separated 
into widening and narrowing phases, but these may be intertwined instead [1]. 
Results of the analysis are reported only at the end based on the computed 
solution, as widening during the fixpoint computation might lead to spurious 
property violations, which later disappear due to narrowing. 
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Reachability Safety. Reachability is mainly determined using value analysis, 
which, for integers, employs abstract domains based on intervals and exclusion 
sets. The value analysis also handles pointers (computing points-to information), 
heap memory (using allocation-site abstraction), structs, unions and arrays. The 
abstraction of arrays employs partitioning by the symbolic expression that is used 
to index into the array. On top of that, both global variables and heap-allocated 
memory are partitioned into disjoint regions [9]. 


No Overflows. The sound interval analysis is implemented using arbitrary pre- 
cision integers. If the interval for an expression lies completely in the value range 
of its signed integer type, no overflow can occur at this location. 


No Data Race. The main goal of GOBLINT is data race detection and its anal- 
yses have been optimized for this purpose. Mutexes may be handled both path- 
sensitively and symbolically. Memory accesses are partitioned (e.g., by heap re- 
gion [9]), while locking expressions and access expressions are correlated using 
address equalities (e.g., a domain of affine and Herbrand equalities [10]) in order 
to analyze more sophisticated locking patterns [11]. 


2 Software Architecture 


GOBLINT is implemented in OCAML and uses an updated fork of CIL [6] as 
its parser frontend for the C language. Since the latter requires preprocessed 
code, GCC is executed for preprocessing the input, although this step should be 
unnecessary on the SV-COMP benchmarks. No other major libraries or external 
tools are required. 

The architecture of GOBLINT [2] is designed to be modular. Analyses, which 
are defined by their abstract domains and transfer functions, can be activated via 
runtime configuration options. A flexible query system allows for communication 
between analyses. Together, the combined analyses and the control-flow graphs 
of the functions in the program provide the side-effecting constraint system, 
which is solved by some local generic solver. While a number of solvers are 
available, the improved top-down solver TD3 [8] was employed for SV-COMP 
2021. Post-processing the solution yields results for the analysis. 


3 Strengths and Weaknesses 


Due to over-approximation, abstract interpretation as employed by GOBLINT 
can only determine whether the correctness specification must hold or may be 
violated, but not whether a concrete violating execution exists. Therefore, to 
avoid a large number of false alarms due to imprecision in SV-COMP, GOBLINT 
only reports results “true” and “unknown” respectively. This is a clear limitation 
of our approach, as all competing tools do report definite violations. The strength 
of our approach, on the other hand, is that it aims to be sound by design (up 
to out-of-scope features of the input program as, e.g., inline assembler). This is 
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evidenced by the fact that GOBLINT does not produce any incorrect results in 
the competition. 

GOBLINT performs best in the SoftwareSystems and ReachSafety-Product- 
Lines categories that consist of larger real-world programs, for which our ap- 
proach is well suited. On the downside, our verifier performs poorly in reacha- 
bility safety categories that contain smaller programs with intricate correctness 
conditions which our abstract domains cannot express. 

Even though the support for checking overflows is very new in GOBLINT, it 
has some success in the NoOverflows category. Unfortunately, the tool has no 
success in SoftwareSystems-*-NoOverflows. 

Although GOBLINT specializes in concurrency, it performs quite poorly in 
the ConcurrencySafety category. We believe this is because most benchmarks in 
the category require rather precise analysis of thread interleavings, which is not 
done in our thread-modular approach. 

As GOBLINT has been optimized for data race detection, it unsurprisingly 
performs better in the NoDataRace demo category. It must be noted that the 
majority of benchmarks in the category were submitted from our own test suite, 
consisting of racy and race-free programs. 

While the analyses can be fine-tuned via configuration options, the parame- 
ters are static and do not currently depend on the property nor the input pro- 
gram. A more granular and dynamic configuration system would allow increased 
precision, by enabling more expensive analyses where necessary, or decreased 
resource usage, by disabling unnecessary analyses, e.g., concurrency analyses on 
single-threaded programs. Furthermore, integrating counterexample-guided ab- 
straction refinement (CEGAR) into our framework might allow GOBLINT to also 
report violations, while avoiding false alarms and gaining more precision. 


4 Tool Setup and Configuration 


GOBLINT version svcomp21-0-g82e03b87 participated in SV-COMP 2021 [4,7]. 
It is available in both binary (Ubuntu 20.04) and source code form at our Git Hub 
repository under the svcomp21 tag. The only runtime dependency is GCC. 
Instructions for building from source can be found in the README. 

Both the tool-info module and the benchmark definition for SV-COMP are 
named goblint. They correspond to running the tool as follows: 


./goblint --conf conf/svcomp21.json --sets ana.specification 
property.prp input.c 


GOBLINT participated in the following categories: ReachSafety, Concurrency- 
Safety, NoOverflows, SoftwareSystems (while opting-out from SoftwareSystems- 
*_MemSafety) and NoDataRace (demo category). 


3 https://github.com/goblint /analyzer/releases/tag/svcomp21 
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Software Project and Contributors 


GOBLINT development takes place on GitHub,* while related publications are 
listed on its website. It is an MIT-licensed joint project of the Technische Uni- 
versitét München (Chair of Formal Languages, Compiler Construction, Software 
Construction) and University of Tartu (Laboratory for Software Science). 


Acknowledgements. This work was supported by Deutsche Forschungsgemein- 
schaft (DFG) — 378803395 /2428 CONVEY and the Estonian Research Council 
grant PSG61. We would like to thank everyone who has contributed to GOBLINT 
over the years. 


References 


10. 


11. 


. Amato, G., Scozzari, F., Seidl, H., Apinis, K., Vojdani, V.: Efficiently intertwining 


widening and narrowing. Science of Computer Programming 120, 1-24 (May 2016). 
DOI: 10.1016/j.scico.2015.12.005 


. Apinis, K.: Frameworks for analyzing multi-threaded C. Ph.D. thesis, Technische 


Universitat Miinchen (2014) 

Apinis, K., Seidl, H., Vojdani, V.: Side-Effecting Constraint Systems: A Swiss Army 
Knife for Program Analysis. In: APLAS 712. pp. 157-172. Springer (2012). DOI: 
10.1007 /978-3-642-35182-2_ 12 


. Beyer, D.: Software Verification: 10th Comparative Evaluation (SV-COMP 2021). 


In: Proc. TACAS (2). LNCS 12652, Springer (2021) 

Cousot, P., Cousot, R.: Abstract interpretation: a unified lattice model for static 
analysis of programs by construction or approximation of fixpoints. In: POPL ’77. 
pp. 238-252 (1977). por: 10.1145/512950.512973 

Necula, G.C., McPeak, S., Rahul, S.P., Weimer, W.: CIL: Intermediate language 
and tools for analysis and transformation of C programs. In: CC ’02. pp. 213-228. 
Springer (2002). pot: 10.1007/3-540-45937-5_16 

Saan, S., Schwarz, M., Apinis, K., Erhard, J., Seidl, H., Vogler, R., Vojdani, V.: 
Goblint at SV-COMP 2021 (Dec 2020). Dor: 10.5281 /zenodo.4485853 

Seidl, H., Vogler, R.: Three improvements to the top-down solver. In: PPDP ’18. 
pp. 1-14 (2018). por: 10.1145/3236950.3236967 

Seidl, H., Vojdani, V.: Region Analysis for Race Detection. In: SAS ’09. pp. 171- 
187. Springer (2009). Dot: 10.1007/978-3-642-03237-0_ 13 

Seidl, H., Vojdani, V., Vene, V.: A Smooth Combination of Linear and Herbrand 
Equalities for Polynomial Time Must-Alias Analysis. In: FM ’09. pp. 644-659. 
Springer (2009). pot: 10.1007/978-3-642-05089-3 41 

Vojdani, V., Apinis, K., Rotov, V., Seidl, H., Vene, V., Vogler, R.: Static Race 
Detection for Device Drivers: The Goblint Approach. In: ASE 2016. pp. 391-402. 
ACM (2016). Dor: 10.1145/2970276.2970337 


* https://github.com/goblint/analyzer 
5 https://goblint.in.tum.de 


442 S. Saan et al. 


Open Access This chapter is licensed under the terms of the Creative Commons 
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), 
which permits use, sharing, adaptation, distribution and reproduction in any medium 
or format, as long as you give appropriate credit to the original author(s) and the 
source, provide a link to the Creative Commons license and indicate if changes were 
made. 

The images or other third party material in this chapter are included in the chapter’s 
Creative Commons license, unless indicated otherwise in a credit line to the material. If 
material is not included in the chapter’s Creative Commons license and your intended 
use is not permitted by statutory regulation or exceeds the permitted use, you will need 
to obtain permission directly from the copyright holder. 


Check for 
updates 


Towards String Support in JayHorn 
(Competition Contribution) 


Ali Shamakhi!® (63), Hossein Hojjat'?@, and Philipp Riimmer*® 


1 University of Tehran, Tehran, Iran 
{ali.shamakhi,hojjat}@ut.ac.ir 
2 Tehran Institute for Advanced Studies, Tehran, Iran 
3 Uppsala University, Uppsala, Sweden 
philipp.ruemmer@it.uu.se 


Abstract. JayHorn is a Horn clause-based model checker for Java pro- 
grams that has been competing at SV-COMP since 2019. An ongoing re- 
search and implementation effort is to add support for String data-type 
to JayHorn. Since current Horn solvers do not support strings natively, 
we consider a representation of (unbounded) strings using algebraic data- 
types, more precisely as lists. This paper discusses Horn clause encodings 
of different string operations, and presents preliminary results. 


1 The JayHorn Approach and Architecture 


We start by summarising the approach used in JayHorn, and refer to earlier pa- 
pers [5,6,7] for more details. JayHorn is a verification tool that encodes sequential 
Java programs as sets of Constrained Horn Clauses (CHCs) in order to check 
for possible assertion violations. The main CHC encoding in JayHorn is inspired 
by refinement types [2] and liquid types [8], and characterises programs in terms 
of method contracts, state invariants, and instance invariants of classes [5]. This 
encoding is over-approximate, and can prove absence of assertion violations. In 
order to find counterexamples, i.e., prove existence of violations, JayHorn also 
offers a bounded, under-approximate program encoding. 

JayHorn is entirely implemented in Java, and uses the Soot framework [10] 
to process Java bytecode, and the CHC solver Eldarica [3] to solve Horn clauses. 


2 Encoding of String Operations 


In this paper, we focus on the handling of Strings and their operations, a feature 
of Java that was not previously supported by JayHorn. Since JayHorn verifies 
programs without imposing bounds on the number of execution steps or the 
size of input data, our goal is to handle also unbounded strings. Unfortunately, 
while there has been significant progress in SMT solving for strings, current CHC 
solvers do not yet support strings natively. We therefore use recursive algebraic 
data types to model strings, and follow the approach proposed in [4]: strings are 
represented using lists, with a binary constructor cons and the constant nil. 
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There are two ways to encode a string using cons and nil. The Left-To-Right 
(LTR) encoding starts with the leftmost character of the string. For example, 
"Jay" = cons(‘J’,cons(‘a’,cons(‘y’,nil))). The Right-to-Left (RTL) encoding 
starts with the rightmost character. Each encoding has its own benefits and 
drawbacks in modeling various operations, an aspect we evaluate in this paper. 

Three different LTR encodings of the concatenation operation are described 
in [4], and equivalent RTL encodings are easy to define. Moving beyond concate- 
nation, in this paper we show models of some of the more involved operations. 


2.1 The CompareTo Operation 


The String.compareTo method in Java returns an integer, which is the differ- 
ence of the length of strings if one of the strings is a prefix of the other (e.g., 
"cat".compareTo("c") == 2), or the difference of their leftmost same-index 
different characters otherwise (e.g., "card".compareTo("cash") == -1, since 
their leftmost same-index different characters are ‘r’ and ‘s’, respectively). 

The method is modeled using predicate P,ec(left, right, comparison_result) 
under LTR encoding, which allows us to recursively remove leftmost characters 
from both strings to reach a state which the comparison_result is known. 


Prec(x, nil, len(a)) < true 

Prec (nil, y, —len(y)) < true 

Prec(@,x,0) + true 

Prec(cons(j,x),cons(k,y),j —k) + jH#k 
Prec(cons(h, x), cons(h, y),d) << Pree(x,y, d) 


The predicate under RTL encoding needs an extra argument to keep track 
of whether the comparison_result is based on character difference or not, so the 
predicate is P (left, right, comparison_result, char-diff). The clauses use the len 
function to compute the length of a string, wichi is a built-in function in Eldarica. 


Pre (; nil, len(a), false) < true 
P (nil, y, —len(y), false) < true 
P o (x,x,0, false) 4+ true 
P,- (cons(h, £), y,d + 1, false) + P'e (x,y, d, false) A len(x) > len(y) 
P’,.(x,cons(h,y),d—1, false) < ae y, d, false) A len(x) < len(y) 
P’,.(cons(j, £), cons(k, x), j — k, true) + j#k 
Pec (cons(h, x),y,d, true) + P,,.(x,y,d, true) 
P’,.(x,cons(h, y),d, true) + P,..(2,y,d, true) 


2.2 Integer to String conversion 


The integer to string conversion relies on extracting digits one by one, which is 
done using integer arithmetic. Under LTR encoding, during the conversion pro- 
cess, the pre-condition stores the rest of the input after removing the converted 
digits so far starting from the lowest position. For example, if the number is 
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i = dy_1--+-dg and the converted string so far is s = “d,_1---do”, the rest of the 
number will be r = d,_,---dx which is stored at the pre-condition. 

The pre-condition in RTL encoding stores the offset of the next digit that 
needs to be extracted, since extracting digits from highest place values requires 
knowing their positions. 


2.3 StartsWith and EndsWith 


The encoding of String.startsWith method needs to consider different states 
of both strings and their relation, which leads to multiple recursive relations. 

For example, if x starts with y, we can prepend c to both strings under LTR 
encoding (to get x’ and y’) and the condition holds on the resulting strings 
(i.e. x starts with y’). For another example, if x does not start with y and 
len(a) > len(y) we can append c to x under RTL encoding (to get x’) and the 
condition holds on the resulting string (i.e. x’ does not start with y). 


Srec(x, nil, true) < true 
Srec(“, x, true) + true 
Srec(nil, y, false) < len(y) > 0 
Srec(cons(j, x), cons(k, y), false) 4+ Srec(ax, y, false) 
(LTR) Srec(cons(h, x), cons(h, y), true) + Srec(x, y, true) 
(LTR) Srec(cons(j, x), cons(k, y), false) < j £k 
(RTL) Srec(cons(h, x), y, true) << Sree(x, y, true) 
(RTL) Srec(c a x), cons(k, x), false) + j#k 
(RTL) Srec(cons(h, £), y, false) 4+ Sree(x,y, false) A len(x) > len(y) 
(RTL) Srec(x, cons(h, y), false) << Sree(x, y, false) 


The RTL encoding of endsWith is the same as LTR encoding of startsWith, 
and the LTR encoding of endsWith is the same as RTL encoding of startsWith. 


2.4 CharAt 


The encoding definition of String.charAt relies on the fact that prepending 
a character to a string under LTR encoding increases indices of all previous 
characters by one, while appending a character to a string under RTL encoding 
does not change those indices. 


(LTR) ChAtrec(cons(h,t),0,h) << true 
(LTR) ChAtrec(cons(h,t),i+1,c) + ChAtrec(t,i,c) ^0 <i < len(t) 
(RTL) ChAtrec(cons(h,t),len(t),h) < true 
(RTL) ChAtrec(cons(h,t),7,c) < ChAtrec(t,i,c) A0 <7 < len(t) 


3 Performance of the String Encoding 


The following table shows the results of JayHorn on the 53 problems in the SV- 
COMP Java track that involve strings. Many of the programs contain string 
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operations that are not yet handled in JayHorn, but the results already make 
it possible to compare encoding choices. Uniformly, RTL performs better than 
LTR (probably because appending characters to strings is more common than 
adding characters in the beginning), and the under-approximating CHC encod- 
ing of JayHorn performs better than the over-approximate encoding (probably 
because over-approximation too often loses information about string contents). 
The choice between Iterative, Recursive, or Recursive-with-precondition [4] for 
string concatenation surprisingly had no effect on the results. 


Iterative Recursive RecursiveWithPrec 
U-Approx|O-Approx|U-Approx|O-Approx| U-Approx|O-Approx 
LTR/RTL/LTR/RTL/LTR|RTL/LTR/RTL/|LTR|RTL/LTR| RTL 

# Solved 4 | 6 1 3 4 | 6 1 3 | 4 | 6 1 3 
Avg. Time (s)| 81 | 79 | 7.5 | 16 | 79 | 78 | 7.6) 16 | 77 | 78 | 7.7} 16 


Encoding 
Choices 


In other respects, JayHorn performed similarly in SV-COMP 2021 [1] as in 
the two previous years. JayHorn gave one incorrect answer, for the problem 
UnsatAddition0O2 and due to the use of unbounded integer arithmetic instead 
of correct Java machine arithmetic semantics. JayHorn could correctly prove 
125 benchmarks safe, and 151 benchmarks unsafe. Changes compared to 2020 
include 59 of the 64 MinePump benchmarks (by encoding enums, see Section 4) 
and 6 of the 53 string benchmarks that JayHorn solves now. 

The biggest factor influencing the performance of JayHorn in SV-COMP is 
still the incomplete model of the Java API in JayHorn, given the large number 
of API tests among the SV-COMP Java benchmarks. Our work on supporting 
Strings, described in this paper, is one of the efforts to address the situation. 


4 Tool Setup 


The version submitted to SV-COMP 2021 is JayHorn version 0.7.5-strings,* 
which is also available on Zenodo [9]. In the configuration used in the compe- 
tition,” JayHorn only applies the Horn solver Eldarica. The Benchexec tool info 
module is called jayhorn.py and the benchmark definition file jayhorn. xml. 
JayHorn competes in the Java category. 

Since JayHorn only has incomplete support for Java enums, in this year we 
added a small source transformation tool® to JayHorn that has the purpose of 
replacing enums with simple integer variables. The script used in the compe- 
tition applies the transformation tool to the benchmark source code prior to 
compilation to bytecode. 


* https: //github.com/jayhorn/jayhorn/releases/tag/v0.7.5-strings 

5 Java options: -Xss40000k -Xmx12g 
JayHorn options: -inline-size 50 -conservative -specs -string-encoding 
recursiveWithPrec -string-direction rtl 

6 https://github.com/jayhorn/jayhorn/tree/devel/enum-eliminator 
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5 Software Project and Contributors 


JayHorn was initially developed by Temesghen Kahsai, Philipp Rttmmer, and 
Martin Schaf, with contributions by Daniel Dietsch, Rody Kersten, Huascar 
Sanchez, and Valentin Wiistholz [6,7]. Further development of the tool is at the 
moment mainly carried out by the authors of this paper. JayHorn is open source, 
and distributed under MIT license on https://github.com/jayhorn/jayhorn. 
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Abstract. JDART performs dynamic symbolic execution of JAVA pro- 
grams: it executes programs with concrete inputs while recording sym- 
bolic constraints on executed program paths. A portfolio of constraint 
solvers is then used for generating new concrete values from recorded con- 
straints that drive execution along previously unexplored paths. For SV- 
COMP 2021, we improved JDART by implementing exploration strate- 
gies, bounded analysis, and path-specific constraint solving strategies, 
as well as by enabling the use of SMT-Lib string theory for encoding of 
string operations. 


1 Overview 


JDART is a dynamic symbolic execution engine for the JAVA virtual machine 
(JVM) built on top of Java PathFinder (JPF) [12]. We first entered SV-COMP 
2020 with JDART. Our corresponding report gives a short overview of JDART’s 
architecture and internals [9]. In this paper, we focus on the description of the fol- 
lowing three improvements that were explicitly motivated by SV-COMP 2021 [2]. 


1. The re-implementation of the internal constraints-tree enables bounded anal- 
ysis and exploration strategies (e.g., breadth first search instead of depth first 
search), 

2. A new CVC4 backend in JCONSTRAINTS is the basis for path-based selection 
of constraint solvers and sequential portfolio solving (using Z3 and CVC4). 

3. We integrate recent advances in string constraint solving [3,10] by modeling 
string operations as SMT-Lib string constraints instead of bit vectors. 


While all three changes contribute to an improved performance of JDART, port- 
folio solving has by far the biggest impact on the number of analyzed benchmark 
instances of SV-COMP 2021. In this paper, we focus on the description of the 
changes for (1) and (2). 


2 Tool Improvements for SV-COMP 2021 


JDART runs as an extension of the JPF software model checker [12], using the 
JAVA virtual machine implemented by JPF and its capabilities for annotating 
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Fig. 1: The architecture and call hierarchy in the constraint solving backend. 


values on the stack and the heap with symbolic information. The tool itself 
is written in JAVA and uses JCONSTRAINTS [6] for encoding SMT problems. 
Moreover, JCONSTRAINTS acts as a frontend to the Z3 [5] or CVC4 [1] SMT 
solver used for finding concrete values that drive the analysis. 


Exploration Strategies. JDART has two main components: the Executor and 
the Explorer. While the Executor runs the concrete analysis and records sym- 
bolic constraints during concrete execution, the Explorer is responsible for explo- 
ration strategies and management of constraints. We re-designed the central data 
structure of the Explorer, the constraints tree, for SV-COMP 2021: The new tree 
supports different exploration strategies (e.g., breadth-first search) and bounds 
on the depth of exploration. In the past, JDART relied on unbounded depth-first 
exploration which would often ‘get trapped’ unrolling unbounded loops or re- 
cursion. Breadth-first search prevents this behavior and is more effective on the 
SV-COMP benchmark set. 


Portfolio-Solving. Figure 1 demonstrates the architecture of the constraint 
solving backend used by JDART and JCONSTRAINTS for SV-COMP; dashed 
components and control-flow have been added for SV-COMP 2021: The bounding 
solver (developed for SV-COMP 2020) calls subsequent solvers with successively 
weaker bounds on numeric variables. For SV-COMP 2021, we use upper bounds 
2, 8, 13, 21, 200, 600, co and symmetric negative lower bounds. The new path- 
specific solver selects the most promising solving approach for every concrete 
path constraint: Currently, constraints involving string operations, type casts, 
or floating-point numbers are handed to the portfolio solver as we expect bet- 
ter performance. The portfolio solver wraps the CVC4 solver, starting repeated 
solving attempts in the case of (fairly frequent and random) segmentation faults 
as well as invocation of Z3 after a fixed timeout of 60 seconds. All other path 
constraints are passed directly to the Z3 solver as JDART used to do with all 
constraints at SV-COMP 2020. 


3 Strengths and Weaknesses 


JDART scored 623 points (max. of 693) in the JAVA track and was declared 
second winner for JAVA, after JAVA RANGER (630 points) [11]. Next best is 
JBMC [4] with 603 points. As Java RANGER and JBMC, JDART did not report 
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a single incorrect verdict. JDART exhibits the general strengths and weaknesses 
of dynamic and symbolic analysis approaches for JAVA programs: 


Fast search for counterexamples. Driven by concrete execution, the analysis 
is fairly fast. JDART (950s)is overall the second fastest tool in cases where it 
can provide an answer after JBMC (650s). Notably, JDART successfully found 
counterexamples in 251 of 253 instances. The second-best tool in this respect is 
JBMC with 243 correct false verdicts. Of the two instances for which JDART 
did not produce counterexamples one uses the split operation for strings that 
JDART does not yet model, leading to an unknown result. For the other instance, 
stack unrolling triggers an out of memory exception during the concolic execution 
of one path through the recursive Ackermann function. 


Path Explosion. JDART is affected by path explosion in programs with long 
sequences of branching instructions with mutually unrelated conditions. Such 
sequences are common in code generated from models in the realm of embedded 
systems, e.g., by the Alarm benchmark instances in SV-COMP 2021. For these 
instances, JDART does not manage to explore all paths in the given time limit. 


Unbounded Behavior. Based on principles of symbolic execution, JDART will 
only terminate on unbounded loops or in case of unbounded recursion when us- 
ing manually configured bounds. In addition, the concolic execution might be 
configured to stop on property violations. As a consequence, assertion errors 
might be used as analysis bounds. For SV-COMP 2021, we used a search depth 
of 270 recorded decisions on paths in the constraints tree which we deemed con- 
servative after initial experiments on the benchmark set: While in 13 instances 
true verdicts were given after exploring exhaustively up to the depth bound, 
there remain 30 problem instances for which JDART timed out exploring the 
search space up to the depth bound and 6 instances raising unknown verdicts 
(including the two mentioned above). 


4 Tool Setup 


The source code of JDART used for the competition artifact [8] is available on 
GitHub!. JDART is designed as a plug-in for JPF and relies on ant as a build sys- 
tem. One of its dependencies is the jpf-core project [12]. The other dependency 
is the JCONSTRAINTS library, which was configured to use Z3 [5] and CVC4 [1] for 
SV-COMP 2021. For the competition, JDART is wrapped by the run-jdart.sh 
shell script which generates . jpf configuration files, specifying which benchmark 
to analyze and the global configuration options of JDART. For SV-COMP 2021, 
we choose termination on the first assertion error, a depth bound of 270 (deci- 
sions on paths in the constraints tree) for exploration, breadth first search as 
exploration strategy, and the described path-specific solver together with itera- 
tive weakening of bounds on values in models as described in Section 2. Z3 is 
configured to run with the sequence solver for strings. The shell script records 
and interprets the output of JDART and can also report the version of JDART. 


1 https: //github.com/tudo-aqua/jdart, Commit 4a9cc43 
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Software Project 


JDART, as used in SV-COMP 2021, is maintained by the Automated Quality 
Assurance Group at TU Dortmund University (in particular by the authors of 
this paper) and is available under the Apache License, version 2.0, on GitHub!. 
An initial version of JDART was developed by the authors of [7] at NASA Ames 
Research Center and Carnegie Mellon University. The original version of JDART 
is available on GitHub?. 
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Abstract. SYMBIOTIC 8 extends the traditional combination of static 
analyses, instrumentation, program slicing, and symbolic execution with 
one substantial novelty, namely a technique mixing symbolic execution 
with k-induction. This technique can prove the correctness of programs 
with possibly unbounded loops, which cannot be done by classic sym- 
bolic execution. SYMBIOTIC 8 delivers also several other improvements. 
In particular, we have modified our fork of the symbolic executor KLEE 
to support the comparison of symbolic pointers. Further, we have tuned 
the shape analysis tool PREDATOR (integrated already in SYMBIOTIC 7) 
to perform better on LLVM bitcode. We have also developed a light-weight 
analysis of relations between variables that can prove the absence of out- 
of-bound accesses to arrays. 


1 Verification Approach 


SYMBIOTIC is a program analysis framework that combines fast static analyses 
with code instrumentation and program slicing to speed up the code verification 
which is then performed by symbolic executor KLEE [3] (or, alternatively, by 
another supported verification tool). The main improvement in SYMBIOTIC 8 is 
a new verification technique combining symbolic execution with k-induction [8] 
that we call KindSE. 


Symbolic execution with k-induction (KindSE) KindSE applies the idea 
of k-induction [8] to paths of the control flow graph. The approach can be roughly 
described by the following three steps. 


1. Set k to 1. Let P be the set of all paths in the control flow graph of length 
k that end in an error location. 

2. Use symbolic execution to execute every path a € P. If the symbolic execu- 
tion says that m is infeasible, remove 7 from P. If m is feasible and it starts 
in the initial location, report that the program is incorrect. 


* This work has been supported by the Czech Science Foundation grant GA20-07487S. 
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3. If P is empty, the control flow graph contains no feasible path of length k 
(or more) leading to an error location and thus we report that the program 
is correct. If P is not empty, we replace each path m € P by paths of length 
k +1 that have 7 as its suffix, increase k by one, and go to step 2. 


To improve the performance, we further extended the algorithm to summarize 
loop iterations. If we process a program location that is a loop header, we start 
unwinding the loop backwards. We over-approximate the states that we get 
in every loop iteration to cover more than one iteration if possible. If we are 
successful, the summarized loop states form an inductive invariant, which can 
help to prove that no error location is reachable from the loop header in k steps. 
Our loop summarization does not handle nested loops (in this case we fall-back 
to the algorithm without loop summarization) and calls of functions. To fix the 
latter restriction, we inline all procedures (if possible) before running KindSE. 

KindSE is implemented in our prototype tool SLOWBEAST [1] which we inte- 
grated into SYMBIOTIC 8. The tool now supports only the unreach-call prop- 
erty. SLOWBEAST can also work as a standard symbolic executor (without k- 
induction), but it is noticeably slower than KLEE and it has some limitations. 
However, it supports symbolic floating point arithmetics, which KLEE does not. 


Workflow of Symbiotic 8 As the first step, a given program is translated to 
LLVM [6]. If the program contains a call to pthread_create, SYMBIOTIC returns 
unknown as it cannot handle parallel programs. The rest of the workflow then 
depends on the verified property, as indicated in Figure 1. 

For unreach-call property, we call slicer to remove instructions that have 
no influence on the property and run KLEE. If KLEE does not decide in 222 
seconds, we run KindSE in SLOWBEAST. If it fails, we run KLEE again and if it 
also fails, we run SLOWBEAST as a standard symbolic executor. If some tool says 


other property 
properties program | unreach-call 


A ———_ 7 
plugins in LLVM 


KLEE (timeout 222s) 


SLOWBEAST (KindSE) 
KLEE 
SLOWBEAST (SE) 


error 
found 


l error path replay 


on unsliced program 


Fig. 1. The workflow of SYMBIOTIC 8 
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that the specified call is unreachable, we return true with the trivial witness. If 
we detect that the specified call is reachable, we try replaying the error path on 
the unsliced program. If the replay confirms that the call is reachable, we return 
false with the error witness generated from the replay. 

For other properties, we instrument the program with the help of various 
analyses. For example, when checking memory safety, we use PREDATOR. [5], 
DG [4], and a values-relations analysis to detect potentially unsafe instructions. 
If PREDATOR says that all instructions are safe, we directly return true. Oth- 
erwise, we slice the program with respect to potentially unsafe instructions and 
call KLEE. The rest of the process is identical to the previous case. 


2 Software Architecture 


All components of SYMBIOTIC 8 use LLVM 10 [6]. Scripts that call and control 
the components according to a given configuration are written in Python. 
Instrumentation module is written in C++. In SYMBIOTIC 8, we have newly 
integrated a values-relations analysis as a plugin into instrumentation. This anal- 
ysis is able to prove valid some accesses into arrays. We have also improved LLVM 
frontend of PREDATOR [5] to perform similarly well as the Gcc frontend. 
Program slicing module is written in C++ and is build around the library 
DG [4]. This year, we sped up the slicer by using more efficient data structures in 
pointer analysis and by using function summaries in data dependence analysis. 
We use our own fork of KLEE [3] that differs from the upstream KLEE mainly 
in using segment-offset pointer representation which allows for better handling of 
symbolic pointers and symbolic-sized allocations. This year, we mended handling 
of symbolic pointers and added support for comparison of symbolic addresses. 
Tool SLOWBEAST [1] is written in Python. Both, KLEE and SLOWBEAST use 
Z3 [7] as the SMT solver. 


3 Strengths and Weaknesses 


Symbolic execution may be very efficient in finding bugs but suffers from the path 
explosion problem which may prevent it from fully analyzing programs with high 
level of branching. We alleviate this problem by using program slicing. However, 
in the presence of unbounded loops or infinite execution paths, program slicing 
does not help unless it removes the unbounded computation from the program. 
Indeed, classical symbolic execution is unable to verify such programs at all. 

To fight the inability of symbolic execution to verify unbounded programs, 
we use KindSE. However, its implementation in SLOWBEAST is still not fully 
matured and it handles only a very restricted set of programs. 


Results of Symbiotic 8 in SV-COMP 2021 SYMBIOTIC 8 won MemSafety 
and SoftwareSystems categories [2]. In the MemSafety category, we lost many 
points in the new MemSafety-Juliet subcategory. These benchmarks contain 


456 M. Chalupa et al. 


threads and SYMBIOTIC immediately answered unknown due to the syntactic 
check mentioned in Section 1. However, most of these benchmarks actually do 
not spawn any thread and thus SYMBIOTIC could analyze them. The victory in 
SoftwareSystems category is mainly due to the dominance on the new uthash 
benchmarks. 

This year, over 500 correct answers produced by SYMBIOTIC were not con- 
firmed. Some of these cases must be accounted to the fact that SYMBIOTIC gen- 
erates only trivial correctness witnesses. However, there are also unconfirmed 
answers because of missing witnesses, which turned out to be a bug in SLOW- 
BEAST integration. Unfortunately, these include all 99 benchmarks that were 
newly proved correct by KindSE, from which 85 were in the ReachSafety-Loops 
subcategory. We had also many unconfirmed witnesses for non-termination vio- 
lation that still need to be investigated. 

SYMBIOTIC had 16 incorrect answers: 14 incorrect true in Termination cat- 
egory and 2 incorrect false in ReachSafety-Floats. All of them were caused by 
last-minute commits that were fixed shortly after the submission deadline. Be- 
cause of these mistakes, SYMBIOTIC ended up on the 4th place instead of on the 
2nd in the Termination category. 

In the Overall meta-category, SYMBIOTIC traditionally took the 4th place as 
every year since 2018. 


4 Tool Setup and Project Contributors 


The archive is available at https://doi.org/10.5281/zenodo.4483882. Run SYM- 
BIOTIC as: 


bin/symbiotic --sv-comp --prp <prpfile> [--32] <source> 


The option --prp sets the verified property and --32 tells SYMBIOTIC to assume 
32-bit architecture (64-bit architecture is assumed by default). 


5 Software Project and Contributors 


SYMBIOTIC 8 for SV-COMP 2021 has been developed by Marek Chalupa, Tomas 
Jašek, Jan Novák, and Anna Rechtdékova under the supervision of Jan Strejček. 
Veronika Sokova provided a valuable help with adjusting PREDATOR modifica- 
tions. SYMBIOTIC is available under the MIT license. All the external components 
that the tool uses are also available under open-source licenses that comply with 
SV-COMP’s policy for the reproduction of results. The source code of SYMBI- 
OTIC can be found at: 


https: //github.com/staticafi/symbiotic 
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Abstract. VeriAbs is a strategy selection-based reachability verifier for C pro- 
grams. The selection of a suitable strategy is from a pre-defined set of strategies 
and by taking into account the syntax and semantics of the code to be verified. 
This year we present VeriAbs version 1.4.1 in which a novel preprocessor to 
strategy selection is introduced. The preprocessor checks for the feasibility of 
performing a lightweight slicing of the input code using function call graph and 
variable reference information. By this if the program is found to be sliceable, 
sub-programs or slices are generated, and the known strategy selection algo- 
rithm of VeriAbs is applied to each slice. The verification results of each slice 
are then composed to derive that of the entire program. This compositional 
verification has improved the scalability of VeriAbs and presented in this paper. 


1 Verification Approach 


VeriAbs is a C program verifier using a portfolio of twelve verification techniques [2]. 
These techniques are organized into four strategies as shown in Figure 1. Each of the 
strategies is defined such that it benefits verification of a specific type of programs. A 
program type is identified by a strategy selector based on the following code-structural 
and variable-data properties: (1) unstructured control flow, (2) loops with arrays, (3) 
short input ranges, and (4) numerical loops in code. The strategy selector looks for 
these properties in the given order and assigns a verification strategy to the code. For 
this it uses code-structure and interval analyses [2]. If the assigned strategy is unable 
to verify the program, it exits unless if the program contains arrays. In that case it 
selects the default strategy corresponding to numerical loops. Kindly refer to [2,3] 
for details on each verification technique implemented in VeriAbs. 

The colored blocks in Figure 1 indicate the enhancements to the tool made this 
year and are explained next. The colored block with a dashed outline indicates 
that the component has been added for the first time in VeriAbs, and that with a 
solid outline indicates that a block that existed in older versions has been modified. 
The dashed arrows indicate information flow added this year. This information is 
the verification result of the respective strategy passed back to the slicer-analyzer 
explained in the next section. Besides these, there are changes in witness generation 
strategies and explained in the next section. 
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Fig. 1. VeriAbs Architecture (S: Program Safe, F: Property Fails, U: Unknown) 


1.1 Tool Enhancements 


Slicer-Analyzer. It has the following responsibilities: (1) checking the sliceability 
of input program P, (2) generating slices P,,P2,...,P, if P is sliceable, and (3) 
computing the verification result R of P. Accordingly, the slicer-analyzer comprises 
of three parts. The first part checks for sliceability. Let main be the entry function 
of P. We define P to be sliceable with respect to main if all distinct functions 
fi, fo,.-, fr directly called from main are defined in P, and are independent of 
each other. We define the functions called from main independent iff main is non- 
recursive; contains no loops or unstructured control flow [2]; there is no transitive 
dependence (made up control and data dependence) between calls to f;,...,f; in 
main; no two functions in f1, fo,...,f, transitively call the same function; and if 
F(fi) is the union of f; and functions transitively called by f;, then no two sets 
in F(fi),F'(f2),-.., (fr) refer to the same global variable in the program. That 
is, if V(F(fi)) is the set of global variables referred by functions in F'(f;) then 
Ym,n | 1<m<r, 1<n<r, mén => V(F(fim))AV(F(fr)) =9. The call graph and 
referred variables information is computed using call-trees, and a light-weight flow 
insensitive pointer analysis. 

If above stated conditions are satisfied then using 


Leas r y { concepts presented in [10], the body of main is sliced 
if(!a) £10; with respect to call(s) to f; to create the entry function 


ölse ITB) £203 main’? of the executable slice P;. Since main is sliced 
ited with respect to calls to fi, P; will only have functions in 
2(){b=0;assert(b);} F(f;) and main’. That is, the set of functions in slice P; 
Fig. 2. Input Code is given by F(f;) U main’. This way the set of all slices 
are generated by the second part of the slicer-analyzer. 
The proposed technique of slicing has the potential to greatly reduce the state space 
of the input program. This hypothesis is supported by experimental results presented 
later. The proposed slicing function uses control- and data-flow information local to 
main, hence it is lightweight. 
Consider the example in Figure 2. One slice from this code is given in Figure 3. 
As seen, function main has been sliced with respect to the call to f2 in Figure 3 


+ 
f 
f 


460 P. Darke et al. 


which contains the error. Function f1 need not be analyzed to find the error. This 
type of slicing is helpful in analyzing large code in which the verifier may run out of 
resources while analyzing an irrelevant function like f1. 


Next, VeriAbs applies its strategy selection to each 


void prain O { slice P;,Vi,1<i<r sequentially. The results of each slice 
if(!a) ; are composed to compute R, the verification result of P, 


elas ae by LAU; by the third part of the slicer-analyzer as follows: if an 


Ue ({b=0;assert(b);} “ror trace is realized for any slice then R is set to failure; 
if all slices are proved to be safe, then R is set to safe; 

: ; otherwise if none of the slices are found to be erroneous 

Big 3--One bite and there exists a slice that could not be verified, then R 


is set to unknown. 


This idea of slicing based on function call and variable reference information has 
been proposed for the first time. It is similar to a concept of clustering presented in [12]. 
Both these techniques partition a given application into independently executable 
slices. But [12] forms clusters with respect to un-called functions in the code base. 
The proposed sliceability criterion on the other hand focuses only on functions called 
from a given (entry) function main. It uses control- and data-flow analyses local 
only to the given function to slice it with respect to calls in its body. This in turn 
removes all functions not called from main. Another technique generates multiple 
backward slices at every calling context with respect to a property to be verified [8]. 
The proposed slicing technique however produces slices with respect to functions 
defined in P and called from main. 

Witness Generation From Slices: VeriAbs stores slices in the form of separate 
C programs. To generate a valid witness from a slice it is critical to report the 
correct line numbers in the witness [5]. The slicer-analyzer maintains correct line 
numbers in the slice with respect to the original code by adding #line directives to 
it. The directives are added at every point in the slice which reads values from the 
environment, starts a block of code, or contains a branching condition. The witness 
generated from such a slice in VeriAbs is valid with respect to the original program. 

Experimental results: The proposed slicing led to VeriAbs successfully analyzing 
120 additional programs in ReachSafety in SV-COMP’21. On the other hand it runs 
out of time while verifying eighteen programs that it could successfully analyze earlier. 
This is due to the additional time required to slice. Overall these values demonstrate 
the feasibility of this approach. 

Next we present modifications made to existing components of VeriAbs. 

Strategy 1: Unstructured Control Flow. The first strategy meant for programs 
with unstructured control flow, thus far executed two verification techniques in 
parallel. The two techniques were evolutionary test generation algorithms using grey 
box fuzzing [13], and k-induction with continuously refined invariants [6]. This year 
we do not use the first algorithm in strategy 1. The reason being that the time taken 
by it to generate useful error traces is very large. We observe that as the program 
complexity increases with the number of constraints, branching conditions, and/or 
non-determinism, so does the time to reach the error by the test evolution algorithm. 
This leads to the effect of no apparent advantage of the algorithm when applied in 
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parallel with k-induction. We present our experimental observations of the given 
algorithm in [2]. On the other hand, not using this algorithm led to time savings 
and verification of a few additional programs. We continue to use this algorithm for 
non-reactive loops and for programs with inputs of short ranges (strategy 3) [2]. Here 
we allocate it an independent thread with no time limits, while results are obtained 
quickly for non-reactive loops. 

Witness Generation. This year VeriAbs uses the same strategies as last year to 
generate violation witnesses [3]. For correctness witnesses VeriAbs derives invariants 
from the over-approximation techniques in its portfolio. To save time this year VeriAbs 
does not extract invariants from k-induction [6] and interpolation [11] to generate 
correctness witnesses. From amongst the impacted witnesses, this led to 12 fewer 
witnesses being validated than last year. 


2 Software Architecture 


VeriAbs uses Vajra to perform full program induction [7], American Fuzzy Lop [13] 
to perform test evolution with fuzzing, and CPAchecker v1.8 [6] in the first strategy 
for k-induction. For bounded model checking VeriAbs uses the C Bounded Model 
Checker (CBMC) v5.10 [9] with the Glucose Syrup SAT solver v4.0 [4]. All remaining 
program analyses are implemented in the TCS Research group’s program analysis 
framework called Prism [12]. The slicer-analyzer and the strategy selector are partly 
implemented in perl. 


3 Strengths and Weaknesses 


The main strengths of VeriAbs lie in its (1) portfolio of sound verification techniques, 
and its ability to (2) perform a lightweight slicing, (3) classify programs based on 
structural and variable data properties of code, and (4) match these code properties 
with suitable verification techniques. The main weakness of VeriAbs lies in its lack of 
an integrated implementation of witness generation that can utilize invariants derived 
across all strategies or techniques. This is because the invariants are to be derived 
from various abstractions, some of which are generated by off-the-shelf tools, and not 
yet extracted. 


4 Tool Setup and Configuration 


The VeriAbs SV-COMP 2021 executable is available for download at https://gitlab. 
com/sosy-lab/sv-comp/archives-2021/- /tree/master/2021/veriabs.zip. To install the tool, 
download the archive, extract its contents, and then follow the installation instructions 
in VeriAbs/INSTALL.txt. To execute VeriAbs, the user needs to specify the property 
file using the --property-file option. The witness is generated in the current working 
directory as witness.graphml. VeriAbs participated in the ReachSafety category of 
SV-COMP 2021. The BenchExec wrapper script for the tool is veriabs.py and the 
benchmark description file is veriabs.xml. A sample command is as follows: 
VeriAbs/scripts/veriabs --property-file reach-safety.prp a.c 


462 P. Darke et al. 


5 Software Project and Contributors 


Few members of the Foundations of Computing group at TCS Research [1] maintain 
VeriAbs. They can be contacted at veriabs.toolQtcs.com. We thank past developers of 
VeriAbs, creators of Prism [12], Vajra, CPAchecker and CBMC. We specially thank 
Bharti Chimdyalwar, Shrawan Kumar and Ulka Shrotri for their insightful reviews. 
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